pith. machine review for the scientific record. sign in

arxiv: 2409.02977 · v2 · pith:ONRDOB34new · submitted 2024-09-04 · 💻 cs.SE · cs.AI

Large Language Model-Based Agents for Software Engineering: A Survey

Pith reviewed 2026-05-17 12:29 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM-based agentsSoftware EngineeringSurveyLarge Language ModelsAI AgentsMulti-agent SystemsSoftware Development
0
0 comments X

The pith

This survey gathers 124 papers on LLM-based agents for software engineering and sorts them by software engineering tasks and agent structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish a clear map of the emerging field of LLM-based agents applied to software engineering. It does this by collecting 124 papers and dividing them according to two viewpoints: one focused on the software engineering activities involved and the other on the design and capabilities of the agents. A sympathetic reader would care because these agents go beyond plain language models by adding perception and tool use, which opens new ways to handle complex real-world development issues through collaboration between agents and humans. The survey wraps up by pointing out open challenges and possible next steps.

Core claim

The recent advance in Large Language Models (LLMs) has shaped a new paradigm of AI agents, i.e., LLM-based agents. Compared to standalone LLMs, LLM-based agents substantially extend the versatility and expertise of LLMs by enhancing LLMs with the capabilities of perceiving and utilizing external resources and tools. To date, LLM-based agents have been applied and shown remarkable effectiveness in Software Engineering (SE). The synergy between multiple agents and human interaction brings further promise in tackling complex real-world SE problems. In this work, we present a comprehensive and systematic survey on LLM-based agents for SE. We collect 124 papers and categorize them from two the SE

What carries the argument

The two-perspective categorization system that organizes papers according to software engineering tasks on one side and agent architectures and interactions on the other.

If this is right

  • Developers gain a structured way to find relevant work on using agents for specific SE activities like coding or testing.
  • Insights into how agent collaboration and human-in-the-loop setups can address more difficult problems in software development.
  • Identification of gaps that point toward research on improving agent reliability and integration with existing SE tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such a categorization might help in creating taxonomies that could be applied to LLM agents in other engineering domains beyond software.
  • Future surveys could track how the field evolves by updating the paper list and reapplying the same perspectives.
  • The emphasis on external resources and tools suggests potential for agents that integrate with version control systems or testing frameworks in novel ways.

Load-bearing premise

The 124 papers collected represent the main body of work in this area without major omissions and the chosen categorization from SE and agent perspectives covers the key distinctions without significant overlaps or missing categories.

What would settle it

A review of recent publications that reveals many important papers on LLM-based agents in software engineering that were not included in the survey or that do not align well with either the SE or agent perspective categories.

read the original abstract

The recent advance in Large Language Models (LLMs) has shaped a new paradigm of AI agents, i.e., LLM-based agents. Compared to standalone LLMs, LLM-based agents substantially extend the versatility and expertise of LLMs by enhancing LLMs with the capabilities of perceiving and utilizing external resources and tools. To date, LLM-based agents have been applied and shown remarkable effectiveness in Software Engineering (SE). The synergy between multiple agents and human interaction brings further promise in tackling complex real-world SE problems. In this work, we present a comprehensive and systematic survey on LLM-based agents for SE. We collect 124 papers and categorize them from two perspectives, i.e., the SE and agent perspectives. In addition, we discuss open challenges and future directions in this critical domain. The repository of this survey is at https://github.com/FudanSELab/Agent4SE-Paper-List.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper presents a survey on LLM-based agents for Software Engineering. It collects 124 papers from the literature and categorizes them using two perspectives: an SE perspective (covering tasks such as requirements, design, coding, testing, and maintenance) and an agent perspective (covering components such as perception, planning, memory, and tool use, along with multi-agent and human-in-the-loop setups). The survey also identifies open challenges and outlines future directions, accompanied by a public GitHub repository listing the papers.

Significance. If the paper collection is shown to be representative and the dual categorization is applied consistently without major gaps or overlaps, the survey would provide a useful map of an emerging interdisciplinary area. The public repository strengthens reproducibility and allows the community to extend the list. However, the overall significance is limited by the absence of a documented, reproducible selection protocol, which is a standard requirement for systematic surveys in this field.

major comments (2)
  1. [Section 2] Collection methodology (Section 2): The claim of a 'comprehensive and systematic survey' rests on the collection of 124 papers, yet no search strings, databases (arXiv, ACM DL, IEEE Xplore, etc.), date range, or inclusion/exclusion criteria are stated. This omission prevents verification that the sample is representative and free of venue or temporal bias.
  2. [Section 4] Categorization framework (Section 4): The two-perspective taxonomy is presented as the core organizational device, but the manuscript provides no explicit discussion or examples of how papers that span multiple SE tasks and agent features are assigned, nor any check for category overlap or unclassified work. Without such validation, the taxonomy's completeness and non-redundancy cannot be assessed.
minor comments (3)
  1. [Abstract] The abstract would benefit from a single sentence stating the time window of the literature search.
  2. [Figure 1] Figure 1 (or the taxonomy diagram) should include a small number of concrete paper examples placed in each leaf category to illustrate classification decisions.
  3. [Repository] The GitHub repository is a clear asset; adding a last-updated date and a brief description of how new papers will be incorporated would further improve its utility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We address each major comment below and will revise the manuscript to enhance methodological transparency and taxonomy clarity.

read point-by-point responses
  1. Referee: [Section 2] Collection methodology (Section 2): The claim of a 'comprehensive and systematic survey' rests on the collection of 124 papers, yet no search strings, databases (arXiv, ACM DL, IEEE Xplore, etc.), date range, or inclusion/exclusion criteria are stated. This omission prevents verification that the sample is representative and free of venue or temporal bias.

    Authors: We acknowledge that the current manuscript does not provide an explicit description of the collection protocol in Section 2. In the revision, we will add a dedicated subsection detailing the search process: databases queried include arXiv, Google Scholar, ACM Digital Library, and IEEE Xplore; search strings combined terms such as 'LLM-based agent' with SE task keywords (e.g., 'requirements engineering', 'code generation', 'testing'); the time range covers January 2022 to August 2024 to capture the post-ChatGPT emergence of the topic; and inclusion criteria require papers to propose, implement, or evaluate LLM agents for concrete SE tasks, while excluding standalone LLM studies without agent or SE focus and non-English publications. This addition will allow independent verification of representativeness. The existing public GitHub repository will be updated with the full search log and paper metadata to support reproducibility. revision: yes

  2. Referee: [Section 4] Categorization framework (Section 4): The two-perspective taxonomy is presented as the core organizational device, but the manuscript provides no explicit discussion or examples of how papers that span multiple SE tasks and agent features are assigned, nor any check for category overlap or unclassified work. Without such validation, the taxonomy's completeness and non-redundancy cannot be assessed.

    Authors: We agree that the manuscript would benefit from explicit guidance on taxonomy application. We will insert a new paragraph in Section 4 describing the assignment rules: each paper is classified by its primary SE task (determined by the core empirical contribution) and primary agent component (e.g., planning when reasoning chains dominate), with secondary aspects noted via cross-references or table footnotes. We will provide three concrete examples of multi-category papers and explain their placement. We will also state that all 124 collected papers fit within the taxonomy after review, with no unclassified items, and briefly discuss how the hierarchical structure reduces overlap. These additions will allow readers to evaluate completeness and non-redundancy. revision: yes

Circularity Check

0 steps flagged

No circularity: survey reports external literature without derivations or self-referential reductions

full rationale

This is a survey paper that collects 124 external papers from the literature and organizes them under two perspectives (SE and agent). It contains no equations, parameter fittings, predictions, or derivations that could reduce to the paper's own inputs by construction. The central claim of comprehensiveness is a descriptive assertion about the collection process rather than a mathematical or fitted result; no self-citation chain or ansatz is used to justify any quantitative output. Per the guidelines, a self-contained descriptive survey against external benchmarks receives score 0 with no steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Surveys rest primarily on domain assumptions about literature coverage and the utility of the chosen taxonomy.

axioms (1)
  • domain assumption The 124 papers identified through the authors' search constitute a sufficiently complete and unbiased sample of relevant LLM-agent SE research.
    Stated implicitly by the claim of a 'comprehensive and systematic survey' without detailed search protocol in the abstract.

pith-pipeline@v0.9.0 · 5463 in / 1086 out tokens · 39330 ms · 2026-05-17T12:29:16.894990+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis

    cs.SE 2026-04 unverdicted novelty 7.0

    ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.

  2. The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE

    cs.SE 2026-04 unverdicted novelty 7.0

    Software engineering scope expands beyond executable code to semi-executable artifacts best diagnosed by the new six-ring Semi-Executable Stack model.

  3. ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

    cs.SE 2026-04 unverdicted novelty 7.0

    ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...

  4. Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

    cs.SE 2026-04 unverdicted novelty 7.0

    A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.

  5. An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor

    cs.SE 2026-04 unverdicted novelty 7.0

    ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.

  6. FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems

    cs.SE 2026-04 unverdicted novelty 7.0

    FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.

  7. Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure

    cs.SE 2026-04 accept novelty 7.0

    Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...

  8. Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering

    cs.SE 2026-03 unverdicted novelty 7.0

    StackRepoQA shows LLMs reach only moderate accuracy on multi-file Java QA tasks, with gains from graph-based retrieval but frequent reliance on verbatim answer reproduction.

  9. Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents

    cs.LG 2026-03 unverdicted novelty 7.0

    A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.

  10. Revisiting DAgger in the Era of LLM-Agents

    cs.LG 2026-05 conditional novelty 6.0

    DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

  11. Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads

    cs.DC 2026-04 unverdicted novelty 6.0

    Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.

  12. EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents

    cs.SE 2025-11 unverdicted novelty 6.0

    EvoDev introduces an iterative feature-driven framework with a DAG-based Feature Map for context propagation that improves LLM agent performance on end-to-end software development tasks by 56.8% over the best baseline.

  13. The Command Line GUIde: Graphical Interfaces from Man Pages via AI

    cs.HC 2025-10 unverdicted novelty 6.0

    GUIde uses AI to translate man pages into graphical interface specifications for command line tools, evaluated on a corpus of real commands.

  14. Agentless: Demystifying LLM-based Software Engineering Agents

    cs.SE 2024-07 conditional novelty 6.0

    Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.

  15. From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines

    cs.SE 2026-05 unverdicted novelty 5.0

    The central challenge in AI-augmented CI/CD is designing authority transfer from humans to agents under constraints, as current systems remain limited to bounded data-plane autonomy backed by external governance.

  16. Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering

    cs.SE 2026-04 unverdicted novelty 5.0

    LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.

  17. Code Semantic Zooming

    cs.HC 2025-10 unverdicted novelty 5.0

    CodeZoom is a pseudocode-based multi-layer abstraction tool that improves developer control and comprehension over LLM code generation compared to direct use of agents like Claude Code.

  18. An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

    cs.SE 2026-04 unverdicted novelty 4.0

    Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.

  19. LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review

    cs.SE 2026-02 unverdicted novelty 3.0

    A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future res...

Reference graph

Works this paper leans on

290 extracted references · 290 canonical work pages · cited by 19 Pith papers · 14 internal anchors

  1. [1]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models.CoRR, abs/2303.18223, 2023

  2. [2]

    Large language models for software engineering: A systematic literature review.ACM Trans

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review.ACM Trans. Softw. Eng. Methodol., 33(8):220:1– 220:79, 2024

  3. [3]

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. Large language models for software engineering: Survey and open problems. In IEEE/ACM International Conference on Software Engineering: Future of Software Engineering, ICSE-FoSE 2023, Melbourne, Australia, May 14-20, 2023, pages 31–53. IEEE, 2023

  4. [4]

    Self-collaboration code generation via chatgpt.ACM Trans

    Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt.ACM Trans. Softw. Eng. Methodol., 33(7):189:1–189:38, 2024

  5. [5]

    Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt.CoRR, abs/2304.10778, 2023

    Burak Yetistiren, Isik ¨Ozsoy, Miray Ayerdem, and Eray T ¨uz ¨un. Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt.CoRR, abs/2304.10778, 2023

  6. [6]

    To- wards enhancing in-context learning for code generation.CoRR, abs/2303.17780, 2023

    Jia Li, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin. To- wards enhancing in-context learning for code generation.CoRR, abs/2303.17780, 2023

  7. [7]

    STALL+: boosting llm-based repository-level code comple- tion with static analysis.CoRR, abs/2406.10018, 2024

    Junwei Liu, Yixuan Chen, Mingwei Liu, Xin Peng, and Yiling Lou. STALL+: boosting llm-based repository-level code comple- tion with static analysis.CoRR, abs/2406.10018, 2024. SEPTEMBER 2024 48

  8. [8]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Ne...

  9. [9]

    Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language mod- els

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language mod- els. In Ren ´e Just and Gordon Fraser, editors,Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, Seattle, WA, USA, July...

  10. [10]

    Software testing with large language models: Survey, landscape, and vision.IEEE Trans

    Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision.IEEE Trans. Software Eng., 50(4):911–936, 2024

  11. [11]

    Lahiri, and Siddhartha Sen

    Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 919–931. IEEE, 2023

  12. [12]

    Less training, more repairing please: Revisiting automated program repair via zero- shot learning

    Chunqiu Steven Xia and Lingming Zhang. Less training, more repairing please: Revisiting automated program repair via zero- shot learning. In Abhik Roychoudhury, Cristian Cadar, and Miryung Kim, editors,Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Founda- tions of Software Engineering, ESEC/FSE 2022, Singa...

  13. [13]

    A quantitative and qualitative evaluation of llm-based explainable fault localization

    Sungmin Kang, Gabin An, and Shin Yoo. A quantitative and qualitative evaluation of llm-based explainable fault localization. Proc. ACM Softw. Eng., 1(FSE):1424–1446, 2024

  14. [14]

    Repair is nearly generation: Multilingual program repair with llms

    Harshit Joshi, Jos ´e Pablo Cambronero S ´anchez, Sumit Gulwani, Vu Le, Gust Verbruggen, and Ivan Radicek. Repair is nearly generation: Multilingual program repair with llms. In Brian Williams, Yiling Chen, and Jennifer Neville, editors,Thirty- Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Application...

  15. [15]

    Prompting is all you need: Automated android bug replay with large language models

    Sidong Feng and Chunyang Chen. Prompting is all you need: Automated android bug replay with large language models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024, pages 67:1–67:13. ACM, 2024

  16. [16]

    Auto- mated program repair in the era of large pre-trained language models

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Auto- mated program repair in the era of large pre-trained language models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 1482–1494. IEEE, 2023

  17. [17]

    Impact of code language models on automated program repair

    Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. Impact of code language models on automated program repair. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 1430–1442. IEEE, 2023

  18. [18]

    Benchmarking and enhancing LLM agents in localizing linux kernel bugs.CoRR, abs/2505.19489, 2025

    Zhenhao Zhou, Zhuochen Huang, Yike He, Chong Wang, Jiajun Wang, Yijian Wu, Xin Peng, and Yiling Lou. Benchmarking and enhancing LLM agents in localizing linux kernel bugs.CoRR, abs/2505.19489, 2025

  19. [20]

    Gardner, Yiming Yang, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, and Amir Yazdan- bakhsh

    Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob R. Gardner, Yiming Yang, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, and Amir Yazdan- bakhsh. Learning performance-improving code edits. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

  20. [21]

    Ai-assisted coding: Experiments with GPT-4.CoRR, abs/2304.13187, 2023

    Russell A Poldrack, Thomas Lu, and Gasper Begus. Ai-assisted coding: Experiments with GPT-4.CoRR, abs/2304.13187, 2023

  21. [22]

    Llm com- piler: Foundation language models for compiler optimization

    Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather. Llm com- piler: Foundation language models for compiler optimization. In Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction, CC ’25, page 141–153, New York, NY, USA,

  22. [23]

    Association for Computing Machinery

  23. [24]

    TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment

    Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, and Yiling Lou. TRANSAGENT: an llm-based multi-agent sys- tem for code translation.CoRR, abs/2409.19894, 2024

  24. [25]

    The rise and potential of large language model based agents: A survey.Sci

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, Qi Zhang, and Tao Gui. Th...

  25. [26]

    Carlos H. C. Ribeiro. Reinforcement learning agents.Artif. Intell. Rev., 17(3):223–250, 2002

  26. [27]

    Littman, and Andrew W

    Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey.J. Artif. Intell. Res., 4:237–285, 1996

  27. [28]

    Steps toward artificial intelligence.Proceedings of the IRE, 49(1):8–30, 1961

    Marvin Minsky. Steps toward artificial intelligence.Proceedings of the IRE, 49(1):8–30, 1961

  28. [29]

    Shelton, Michael J

    Charles Lee Isbell Jr., Christian R. Shelton, Michael J. Kearns, Satinder Singh, and Peter Stone. A social reinforcement learning agent. In Elisabeth Andr ´e, Sandip Sen, Claude Frasson, and J¨org P . M¨uller, editors,Proceedings of the Fifth International Con- ference on Autonomous Agents, AGENTS 2001, Montreal, Canada, May 28 - June 1, 2001, pages 377–3...

  29. [30]

    A survey on large language model based autonomous agents.Frontiers Comput

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents.Frontiers Comput. Sci., 18(6):186345, 2024

  30. [31]

    A survey on the memory mechanism of large language model based agents.ACM Trans

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents.ACM Trans. Inf. Syst., July 2025. Just Accepted

  31. [32]

    1968, Brussels, Scientific Affairs Division, NATO

    Peter Naur and Brian Randell.Software Engineering: Report of a conference sponsored by the NATO Science Committee, Garmisch, Germany, 7-11 Oct. 1968, Brussels, Scientific Affairs Division, NATO. 1969

  32. [33]

    Dictionary of Computer Science, Engineering and Technology

    Philip A Laplante, Naoufel Werghi, Christopher Lee Kuszmavl, Chris Verhof, Brian Henderson-Sellers, Joseph L Ganley, Ian Sommerville, Amos R Omondi, Ling Guan, Marco Gori, et al. Dictionary of Computer Science, Engineering and Technology. CRC Press, 2017

  33. [34]

    Barry W. Boehm. A view of 20th and 21st century software engineering. In Leon J. Osterweil, H. Dieter Rombach, and Mary Lou Soffa, editors,28th International Conference on Software Engineering (ICSE 2006), Shanghai, China, May 20-28, 2006, pages 12–29. ACM, 2006

  34. [35]

    Chawla, Olaf Wiest, and Xiangliang Zhang

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3-9, 2024, pages 8048–8057. ij...

  35. [36]

    Exploring large language model based intelligent agents: Definitions, methods, and prospects

    Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, and Xiuqiang He. Exploring large language model based intelligent agents: Definitions, methods, and prospects. CoRR, abs/2401.03428, 2024

  36. [37]

    Augmented language models: A survey.Trans

    Gr ´egoire Mialon, Roberto Dess `ı, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Bap- tiste Rozi `ere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. Augmented language models: A survey.Trans. Mach. Learn. Res., 2023, 2023

  37. [38]

    Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents.J

    Yue Liu, Sin Kit Lo, Qinghua Lu, Liming Zhu, Dehai Zhao, Xiwei Xu, Stefan Harrer, and Jon Whittle. Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents.J. Syst. Softw., 220:112278, 2025

  38. [39]

    Exploring autonomous agents through the lens of large language models: A review.CoRR, abs/2404.04442, 2024

    Saikat Barua. Exploring autonomous agents through the lens of large language models: A review.CoRR, abs/2404.04442, 2024

  39. [40]

    A survey on large language models for code generation

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation. ACM Trans. Softw. Eng. Methodol., July 2025. Just Accepted

  40. [41]

    Llm-based multi- agent systems for software engineering: Literature review, vision, SEPTEMBER 2024 49 and the road ahead.ACM Trans

    Junda He, Christoph Treude, and David Lo. Llm-based multi- agent systems for software engineering: Literature review, vision, SEPTEMBER 2024 49 and the road ahead.ACM Trans. Softw. Eng. Methodol., 34(5), May 2025

  41. [42]

    Zhang, Max Hort, Mark Harman, and Federica Sarro

    Zhenpeng Chen, Jie M. Zhang, Max Hort, Mark Harman, and Federica Sarro. Fairness testing: A comprehensive survey and analysis of trends.ACM Trans. Softw. Eng. Methodol., 33(5):137:1– 137:59, 2024

  42. [43]

    A survey of compiler testing

    Junjie Chen, Jibesh Patra, Michael Pradel, Yingfei Xiong, Hongyu Zhang, Dan Hao, and Lu Zhang. A survey of compiler testing. ACM Comput. Surv., 53(1):4:1–4:36, 2021

  43. [44]

    Find- ing trends in software research.IEEE Trans

    George Mathew, Amritanshu Agrawal, and Tim Menzies. Find- ing trends in software research.IEEE Trans. Software Eng., 49(4):1397–1410, 2023

  44. [45]

    Empirical research in software engineering - A literature survey.J

    Li Zhang, Jia-Hao Tian, Jing Jiang, Yi-Jun Liu, Meng-Yuan Pu, and Tao Yue. Empirical research in software engineering - A literature survey.J. Comput. Sci. Technol., 33(5):876–899, 2018

  45. [46]

    Zhang, Mark Harman, Lei Ma, and Yang Liu

    Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. Machine learning testing: Survey, landscapes and horizons.IEEE Trans. Software Eng., 48(2):1–36, 2022

  46. [47]

    https://dblp.org, 2024

    DBLP. https://dblp.org, 2024

  47. [48]

    https://blog.dblp.org/2024/01/01/ 7-million-publications/, 2024

    7 million publications. https://blog.dblp.org/2024/01/01/ 7-million-publications/, 2024

  48. [49]

    https://arxiv.org/, 2024

    arXiv. https://arxiv.org/, 2024

  49. [50]

    Opinion mining for software development: A systematic literature review.ACM Trans

    Bin Lin, Nathan Cassee, Alexander Serebrenik, Gabriele Bavota, Nicole Novielli, and Michele Lanza. Opinion mining for software development: A systematic literature review.ACM Trans. Softw. Eng. Methodol., 31(3):38:1–38:41, 2022

  50. [51]

    RWTH, Fachgruppe Informatik Aachen, 1996

    Klaus Pohl.Requirements Engineering: An Overview. RWTH, Fachgruppe Informatik Aachen, 1996

  51. [52]

    Easterbrook

    Bashar Nuseibeh and Steve M. Easterbrook. Requirements engi- neering: A roadmap. In Anthony Finkelstein, editor,22nd Inter- national Conference on on Software Engineering, Future of Software Engineering Track, ICSE 2000, Limerick Ireland, June 4-11, 2000, pages 35–46. ACM, 2000

  52. [53]

    Requirements engineering: A survey.Communications on Applied Electronics, 3(5):28–31, 2015

    Vivek Shukla, Dhirendra Pandey, and Raj Shree. Requirements engineering: A survey.Communications on Applied Electronics, 3(5):28–31, 2015

  53. [54]

    The unified modeling language.Unix Review, 14(13):5, 1996

    Grady Booch, Ivar Jacobson, James Rumbaugh, et al. The unified modeling language.Unix Review, 14(13):5, 1996

  54. [55]

    Entity- relationship-attribute designs and sketches.Theory and Applica- tions of Categories, 10(3):94–112, 2002

    Michael Johnson, Robert Rosebrugh, and RJ Wood. Entity- relationship-attribute designs and sketches.Theory and Applica- tions of Categories, 10(3):94–112, 2002

  55. [56]

    Marcos, and J

    Alejandro Rago, Claudia A. Marcos, and J. Andr ´es D ´ıaz Pace. Uncovering quality-attribute concerns in use case specifications via early aspect mining.Requir. Eng., 18(1):67–84, 2013

  56. [57]

    The applications of natural language processing (NLP) for software requirement engineering - A systematic literature review

    Farhana Nazir, Wasi Haider Butt, Muhammad Waseem Anwar, and Muazzam Ali Khan Khattak. The applications of natural language processing (NLP) for software requirement engineering - A systematic literature review. In Kuinam Kim and Nikolai Joukov, editors,Information Science and Applications 2017 - ICISA 2017, Macau, China, 20-23 March 2017, volume 424 ofLec...

  57. [58]

    Automatically classifying user requests in crowdsourcing requirements engineering.J

    Chuanyi Li, Liguo Huang, Jidong Ge, Bin Luo, and Vincent Ng. Automatically classifying user requests in crowdsourcing requirements engineering.J. Syst. Softw., 138:108–123, 2018

  58. [59]

    Advances in au- tomated support for requirements engineering: A systematic literature review.Requir

    Muhammad Aminu Umar and Kevin Lano. Advances in au- tomated support for requirements engineering: A systematic literature review.Requir. Eng., 29(2):177–207, 2024

  59. [60]

    PRCBERT: prompt learning for requirement classification using bert-based pretrained language models

    Xianchang Luo, Yinxing Xue, Zhenchang Xing, and Jiamou Sun. PRCBERT: prompt learning for requirement classification using bert-based pretrained language models. In37th IEEE/ACM Inter- national Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022, pages 75:1–75:13. ACM, 2022

  60. [61]

    Using llms in software requirements specifications: An empirical evaluation

    Madhava Krishna, Bhagesh Gaur, Arsh Verma, and Pankaj Jalote. Using llms in software requirements specifications: An empirical evaluation. In Grischa Liebel, Irit Hadar, and Paola Spoletini, ed- itors,32nd IEEE International Requirements Engineering Conference, RE 2024, Reykjavik, Iceland, June 24-28, 2024, pages 475–483. IEEE, 2024

  61. [62]

    Empirical evaluation of chatgpt on requirements information retrieval under zero-shot setting

    Jianzhang Zhang, Yiyang Chen, Chuang Liu, Nan Niu, and Yinglin Wang. Empirical evaluation of chatgpt on requirements information retrieval under zero-shot setting. In2023 Inter- national Conference on Intelligent Computing and Next Generation Networks (ICNGN), pages 1–6. IEEE, 2023

  62. [63]

    Krishna Ronanki, Beatriz Cabrero Daniel, and Christian Berger. Chatgpt as a tool for user story quality evaluation: Trustworthy out of the box? In Philippe Kruchten and Peggy Gregory, editors, Agile Processes in Software Engineering and Extreme Programming - Workshops - XP 2022 Workshops, Copenhagen, Denmark, June 13-17, 2022, and XP 2023 Workshops, Amste...

  63. [64]

    Improving requirements completeness: Automated assistance through large language models.Requir

    Dipeeka Luitel, Shabnam Hassani, and Mehrdad Sabetzadeh. Improving requirements completeness: Automated assistance through large language models.Requir. Eng., 29(1):73–95, 2024

  64. [65]

    Mohammadmehdi Ataei, Hyunmin Cheong, Daniele Grandi, Ye Wang, Nigel Morris, and Alexander Tessier. Elicitron: A large language model agent-based simulation framework for design requirements elicitation.Journal of Computing and Information Science in Engineering, 25(2):021012, 01 2025

  65. [66]

    Specgen: Automated generation of formal program specifications via large language models

    Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. Specgen: Automated generation of formal program specifications via large language models. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025, pages 16–28. IEEE, 2025

  66. [67]

    Springer Nature Switzerland, Cham, 2024

    Chetan Arora, John Grundy, and Mohamed Abdelrazek.Advanc- ing Requirements Engineering Through Generative AI: Assessing the Role of LLMs, pages 129–148. Springer Nature Switzerland, Cham, 2024

  67. [68]

    Mare: Multi-agents col- laboration framework for requirements engineering,

    Dongming Jin, Zhi Jin, Xiaohong Chen, and Chunhui Wang. MARE: multi-agents collaboration framework for requirements engineering.CoRR, abs/2405.03256, 2024

  68. [69]

    David R. Cok. Openjml: JML for java 7 by extending openjdk. In Mihaela Gheorghiu Bobaru, Klaus Havelund, Gerard J. Holz- mann, and Rajeev Joshi, editors,NASA Formal Methods - Third International Symposium, NFM 2011, Pasadena, CA, USA, April 18- 20, 2011. Proceedings, volume 6617 ofLecture Notes in Computer Science, pages 472–479. Springer, 2011

  69. [70]

    Rustan M

    Cormac Flanagan and K. Rustan M. Leino. Houdini, an anno- tation assistant for esc/java. In Jos ´e Nuno Oliveira and Pamela Zave, editors,FME 2001: Formal Methods for Increasing Software Productivity, International Symposium of Formal Methods Europe, Berlin, Germany, March 12-16, 2001, Proceedings, volume 2021 of Lecture Notes in Computer Science, pages 5...

  70. [71]

    Ernst, Jeff H

    Michael D. Ernst, Jeff H. Perkins, Philip J. Guo, Stephen McCa- mant, Carlos Pacheco, Matthew S. Tschantz, and Chen Xiao. The daikon system for dynamic detection of likely invariants.Sci. Comput. Program., 69(1-3):35–45, 2007

  71. [72]

    Zhang, Yang Liu, and Yun Ma

    Yaoqi Guo, Zhenpeng Chen, Jie M. Zhang, Yang Liu, and Yun Ma. Personality-guided code generation using large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1068– 1080, Vienna, Austria, July 2025. Association for Computational Linguistics

  72. [73]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the AI ocean: A survey on hallucination in large language models.CoRR, abs/2309.01219, 2023

  73. [74]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing System...

  74. [75]

    Fully autonomous programming with large language models

    Vadim Liventsev, Anastasiia Grishina, Aki H ¨arm¨a, and Leon Moonen. Fully autonomous programming with large language models. In Sara Silva and Lu ´ıs Paquete, editors,Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2023, Lisbon, Portugal, July 15-19, 2023, pages 1146–1155. ACM, 2023

  75. [76]

    Olausson, Jeevana Priya Inala, Chenglong Wang, Jian- feng Gao, and Armando Solar-Lezama

    Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jian- feng Gao, and Armando Solar-Lezama. Is self-repair a silver bullet for code generation? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  76. [78]

    Autogen: Enabling next-gen LLM applications via multi- agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi- agent conversations. InFirst Conference on Language Modeling, 2024

  77. [79]

    INTERVENOR: prompting the coding ability of large language models with the interactive chain of repair

    Hanbin Wang, Zhenghao Liu, Shuo Wang, Ganqu Cui, Ning Ding, Zhiyuan Liu, and Ge Yu. INTERVENOR: prompting the coding ability of large language models with the interactive chain of repair. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, SEPTEMBER 2024 50 editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and vi...

  78. [80]

    Test-driven development and llm-based code generation

    Noble Saji Mathews and Meiyappan Nagappan. Test-driven development and llm-based code generation. In Vladimir Filkov, Baishakhi Ray, and Minghui Zhou, editors,Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engi- neering, ASE 2024, Sacramento, CA, USA, October 27 - November 1, 2024, pages 1583–1594. ACM, 2024

  79. [81]

    Autocoder: Enhancing code large language model with aiev-instruct.CoRR, abs/2405.14906, 2024

    Bin Lei, Yuchen Li, and Qiuwu Chen. Autocoder: Enhancing code large language model with aiev-instruct.CoRR, abs/2405.14906, 2024

  80. [82]

    Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules

    Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq Joty. Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

Showing first 80 references.