arxiv: 2409.02977 · v2 · pith:ONRDOB34new · submitted 2024-09-04 · 💻 cs.SE · cs.AI

Large Language Model-Based Agents for Software Engineering: A Survey

Junwei Liu , Kaixin Wang , Yixuan Chen , Xin Peng , Zhenpeng Chen , Lingming Zhang , Yiling Lou This is my paper

Pith reviewed 2026-05-17 12:29 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM-based agentsSoftware EngineeringSurveyLarge Language ModelsAI AgentsMulti-agent SystemsSoftware Development

0 comments

The pith

This survey gathers 124 papers on LLM-based agents for software engineering and sorts them by software engineering tasks and agent structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish a clear map of the emerging field of LLM-based agents applied to software engineering. It does this by collecting 124 papers and dividing them according to two viewpoints: one focused on the software engineering activities involved and the other on the design and capabilities of the agents. A sympathetic reader would care because these agents go beyond plain language models by adding perception and tool use, which opens new ways to handle complex real-world development issues through collaboration between agents and humans. The survey wraps up by pointing out open challenges and possible next steps.

Core claim

The recent advance in Large Language Models (LLMs) has shaped a new paradigm of AI agents, i.e., LLM-based agents. Compared to standalone LLMs, LLM-based agents substantially extend the versatility and expertise of LLMs by enhancing LLMs with the capabilities of perceiving and utilizing external resources and tools. To date, LLM-based agents have been applied and shown remarkable effectiveness in Software Engineering (SE). The synergy between multiple agents and human interaction brings further promise in tackling complex real-world SE problems. In this work, we present a comprehensive and systematic survey on LLM-based agents for SE. We collect 124 papers and categorize them from two the SE

What carries the argument

The two-perspective categorization system that organizes papers according to software engineering tasks on one side and agent architectures and interactions on the other.

If this is right

Developers gain a structured way to find relevant work on using agents for specific SE activities like coding or testing.
Insights into how agent collaboration and human-in-the-loop setups can address more difficult problems in software development.
Identification of gaps that point toward research on improving agent reliability and integration with existing SE tools.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such a categorization might help in creating taxonomies that could be applied to LLM agents in other engineering domains beyond software.
Future surveys could track how the field evolves by updating the paper list and reapplying the same perspectives.
The emphasis on external resources and tools suggests potential for agents that integrate with version control systems or testing frameworks in novel ways.

Load-bearing premise

The 124 papers collected represent the main body of work in this area without major omissions and the chosen categorization from SE and agent perspectives covers the key distinctions without significant overlaps or missing categories.

What would settle it

A review of recent publications that reveals many important papers on LLM-based agents in software engineering that were not included in the survey or that do not align well with either the SE or agent perspective categories.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard survey that organizes 124 papers on LLM agents for SE into two perspectives but provides little evidence that the collection is complete or that the categories avoid overlap.

read the letter

The core of this paper is a literature survey that pulls together existing work on LLM-based agents applied to software engineering. They gathered 124 papers and split the discussion into an SE perspective focused on tasks like code generation or testing and an agent perspective focused on architecture, tools, and multi-agent setups. That dual view is the main organizing move they make, along with a section on open challenges and a GitHub repo listing the papers.

Referee Report

2 major / 3 minor

Summary. The paper presents a survey on LLM-based agents for Software Engineering. It collects 124 papers from the literature and categorizes them using two perspectives: an SE perspective (covering tasks such as requirements, design, coding, testing, and maintenance) and an agent perspective (covering components such as perception, planning, memory, and tool use, along with multi-agent and human-in-the-loop setups). The survey also identifies open challenges and outlines future directions, accompanied by a public GitHub repository listing the papers.

Significance. If the paper collection is shown to be representative and the dual categorization is applied consistently without major gaps or overlaps, the survey would provide a useful map of an emerging interdisciplinary area. The public repository strengthens reproducibility and allows the community to extend the list. However, the overall significance is limited by the absence of a documented, reproducible selection protocol, which is a standard requirement for systematic surveys in this field.

major comments (2)

[Section 2] Collection methodology (Section 2): The claim of a 'comprehensive and systematic survey' rests on the collection of 124 papers, yet no search strings, databases (arXiv, ACM DL, IEEE Xplore, etc.), date range, or inclusion/exclusion criteria are stated. This omission prevents verification that the sample is representative and free of venue or temporal bias.
[Section 4] Categorization framework (Section 4): The two-perspective taxonomy is presented as the core organizational device, but the manuscript provides no explicit discussion or examples of how papers that span multiple SE tasks and agent features are assigned, nor any check for category overlap or unclassified work. Without such validation, the taxonomy's completeness and non-redundancy cannot be assessed.

minor comments (3)

[Abstract] The abstract would benefit from a single sentence stating the time window of the literature search.
[Figure 1] Figure 1 (or the taxonomy diagram) should include a small number of concrete paper examples placed in each leaf category to illustrate classification decisions.
[Repository] The GitHub repository is a clear asset; adding a last-updated date and a brief description of how new papers will be incorporated would further improve its utility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. We address each major comment below and will revise the manuscript to enhance methodological transparency and taxonomy clarity.

read point-by-point responses

Referee: [Section 2] Collection methodology (Section 2): The claim of a 'comprehensive and systematic survey' rests on the collection of 124 papers, yet no search strings, databases (arXiv, ACM DL, IEEE Xplore, etc.), date range, or inclusion/exclusion criteria are stated. This omission prevents verification that the sample is representative and free of venue or temporal bias.

Authors: We acknowledge that the current manuscript does not provide an explicit description of the collection protocol in Section 2. In the revision, we will add a dedicated subsection detailing the search process: databases queried include arXiv, Google Scholar, ACM Digital Library, and IEEE Xplore; search strings combined terms such as 'LLM-based agent' with SE task keywords (e.g., 'requirements engineering', 'code generation', 'testing'); the time range covers January 2022 to August 2024 to capture the post-ChatGPT emergence of the topic; and inclusion criteria require papers to propose, implement, or evaluate LLM agents for concrete SE tasks, while excluding standalone LLM studies without agent or SE focus and non-English publications. This addition will allow independent verification of representativeness. The existing public GitHub repository will be updated with the full search log and paper metadata to support reproducibility. revision: yes
Referee: [Section 4] Categorization framework (Section 4): The two-perspective taxonomy is presented as the core organizational device, but the manuscript provides no explicit discussion or examples of how papers that span multiple SE tasks and agent features are assigned, nor any check for category overlap or unclassified work. Without such validation, the taxonomy's completeness and non-redundancy cannot be assessed.

Authors: We agree that the manuscript would benefit from explicit guidance on taxonomy application. We will insert a new paragraph in Section 4 describing the assignment rules: each paper is classified by its primary SE task (determined by the core empirical contribution) and primary agent component (e.g., planning when reasoning chains dominate), with secondary aspects noted via cross-references or table footnotes. We will provide three concrete examples of multi-category papers and explain their placement. We will also state that all 124 collected papers fit within the taxonomy after review, with no unclassified items, and briefly discuss how the hierarchical structure reduces overlap. These additions will allow readers to evaluate completeness and non-redundancy. revision: yes

Circularity Check

0 steps flagged

No circularity: survey reports external literature without derivations or self-referential reductions

full rationale

This is a survey paper that collects 124 external papers from the literature and organizes them under two perspectives (SE and agent). It contains no equations, parameter fittings, predictions, or derivations that could reduce to the paper's own inputs by construction. The central claim of comprehensiveness is a descriptive assertion about the collection process rather than a mathematical or fitted result; no self-citation chain or ansatz is used to justify any quantitative output. Per the guidelines, a self-contained descriptive survey against external benchmarks receives score 0 with no steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Surveys rest primarily on domain assumptions about literature coverage and the utility of the chosen taxonomy.

axioms (1)

domain assumption The 124 papers identified through the authors' search constitute a sufficiently complete and unbiased sample of relevant LLM-agent SE research.
Stated implicitly by the claim of a 'comprehensive and systematic survey' without detailed search protocol in the abstract.

pith-pipeline@v0.9.0 · 5463 in / 1086 out tokens · 39330 ms · 2026-05-17T12:29:16.894990+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
cs.SE 2026-04 unverdicted novelty 7.0

ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE
cs.SE 2026-04 unverdicted novelty 7.0

Software engineering scope expands beyond executable code to semi-executable artifacts best diagnosed by the new six-ring Semi-Executable Stack model.
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
cs.SE 2026-04 unverdicted novelty 7.0

ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...
Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios
cs.SE 2026-04 unverdicted novelty 7.0

A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor
cs.SE 2026-04 unverdicted novelty 7.0

ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems
cs.SE 2026-04 unverdicted novelty 7.0

FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
cs.SE 2026-04 accept novelty 7.0

Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...
Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering
cs.SE 2026-03 unverdicted novelty 7.0

StackRepoQA shows LLMs reach only moderate accuracy on multi-file Java QA tasks, with gains from graph-based retrieval but frequent reliance on verbatim answer reproduction.
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
cs.LG 2026-03 unverdicted novelty 7.0

A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
Revisiting DAgger in the Era of LLM-Agents
cs.LG 2026-05 conditional novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads
cs.DC 2026-04 unverdicted novelty 6.0

Combining local routing with prompt compression saves 45-79% cloud tokens on edit and explanation workloads, while a fuller set including draft-review saves 51% on RAG-heavy tasks.
EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents
cs.SE 2025-11 unverdicted novelty 6.0

EvoDev introduces an iterative feature-driven framework with a DAG-based Feature Map for context propagation that improves LLM agent performance on end-to-end software development tasks by 56.8% over the best baseline.
The Command Line GUIde: Graphical Interfaces from Man Pages via AI
cs.HC 2025-10 unverdicted novelty 6.0

GUIde uses AI to translate man pages into graphical interface specifications for command line tools, evaluated on a corpus of real commands.
Agentless: Demystifying LLM-based Software Engineering Agents
cs.SE 2024-07 conditional novelty 6.0

Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.
From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines
cs.SE 2026-05 unverdicted novelty 5.0

The central challenge in AI-augmented CI/CD is designing authority transfer from humans to agents under constraints, as current systems remain limited to bounded data-plane autonomy backed by external governance.
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
cs.SE 2026-04 unverdicted novelty 5.0

LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
Code Semantic Zooming
cs.HC 2025-10 unverdicted novelty 5.0

CodeZoom is a pseudocode-based multi-layer abstraction tool that improves developer control and comprehension over LLM code generation compared to direct use of agents like Claude Code.
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models
cs.SE 2026-04 unverdicted novelty 4.0

Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
cs.SE 2026-02 unverdicted novelty 3.0

A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future res...

Reference graph

Works this paper leans on

290 extracted references · 290 canonical work pages · cited by 19 Pith papers · 14 internal anchors

[1]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models.CoRR, abs/2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Large language models for software engineering: A systematic literature review.ACM Trans

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review.ACM Trans. Softw. Eng. Methodol., 33(8):220:1– 220:79, 2024

work page 2024
[3]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. Large language models for software engineering: Survey and open problems. In IEEE/ACM International Conference on Software Engineering: Future of Software Engineering, ICSE-FoSE 2023, Melbourne, Australia, May 14-20, 2023, pages 31–53. IEEE, 2023

work page 2023
[4]

Self-collaboration code generation via chatgpt.ACM Trans

Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt.ACM Trans. Softw. Eng. Methodol., 33(7):189:1–189:38, 2024

work page 2024
[5]

Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt.CoRR, abs/2304.10778, 2023

Burak Yetistiren, Isik ¨Ozsoy, Miray Ayerdem, and Eray T ¨uz ¨un. Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt.CoRR, abs/2304.10778, 2023

work page arXiv 2023
[6]

To- wards enhancing in-context learning for code generation.CoRR, abs/2303.17780, 2023

Jia Li, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin. To- wards enhancing in-context learning for code generation.CoRR, abs/2303.17780, 2023

work page arXiv 2023
[7]

STALL+: boosting llm-based repository-level code comple- tion with static analysis.CoRR, abs/2406.10018, 2024

Junwei Liu, Yixuan Chen, Mingwei Liu, Xin Peng, and Yiling Lou. STALL+: boosting llm-based repository-level code comple- tion with static analysis.CoRR, abs/2406.10018, 2024. SEPTEMBER 2024 48

work page arXiv 2024
[8]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Ne...

work page 2023
[9]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language mod- els

Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language mod- els. In Ren ´e Just and Gordon Fraser, editors,Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, Seattle, WA, USA, July...

work page 2023
[10]

Software testing with large language models: Survey, landscape, and vision.IEEE Trans

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. Software testing with large language models: Survey, landscape, and vision.IEEE Trans. Software Eng., 50(4):911–936, 2024

work page 2024
[11]

Lahiri, and Siddhartha Sen

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 919–931. IEEE, 2023

work page 2023
[12]

Less training, more repairing please: Revisiting automated program repair via zero- shot learning

Chunqiu Steven Xia and Lingming Zhang. Less training, more repairing please: Revisiting automated program repair via zero- shot learning. In Abhik Roychoudhury, Cristian Cadar, and Miryung Kim, editors,Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Founda- tions of Software Engineering, ESEC/FSE 2022, Singa...

work page 2022
[13]

A quantitative and qualitative evaluation of llm-based explainable fault localization

Sungmin Kang, Gabin An, and Shin Yoo. A quantitative and qualitative evaluation of llm-based explainable fault localization. Proc. ACM Softw. Eng., 1(FSE):1424–1446, 2024

work page 2024
[14]

Repair is nearly generation: Multilingual program repair with llms

Harshit Joshi, Jos ´e Pablo Cambronero S ´anchez, Sumit Gulwani, Vu Le, Gust Verbruggen, and Ivan Radicek. Repair is nearly generation: Multilingual program repair with llms. In Brian Williams, Yiling Chen, and Jennifer Neville, editors,Thirty- Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Application...

work page 2023
[15]

Prompting is all you need: Automated android bug replay with large language models

Sidong Feng and Chunyang Chen. Prompting is all you need: Automated android bug replay with large language models. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024, pages 67:1–67:13. ACM, 2024

work page 2024
[16]

Auto- mated program repair in the era of large pre-trained language models

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Auto- mated program repair in the era of large pre-trained language models. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 1482–1494. IEEE, 2023

work page 2023
[17]

Impact of code language models on automated program repair

Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. Impact of code language models on automated program repair. In45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023, pages 1430–1442. IEEE, 2023

work page 2023
[18]

Benchmarking and enhancing LLM agents in localizing linux kernel bugs.CoRR, abs/2505.19489, 2025

Zhenhao Zhou, Zhuochen Huang, Yike He, Chong Wang, Jiajun Wang, Yijian Wu, Xin Peng, and Yiling Lou. Benchmarking and enhancing LLM agents in localizing linux kernel bugs.CoRR, abs/2505.19489, 2025

work page arXiv 2025
[20]

Gardner, Yiming Yang, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, and Amir Yazdan- bakhsh

Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob R. Gardner, Yiming Yang, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, and Amir Yazdan- bakhsh. Learning performance-improving code edits. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024
[21]

Ai-assisted coding: Experiments with GPT-4.CoRR, abs/2304.13187, 2023

Russell A Poldrack, Thomas Lu, and Gasper Begus. Ai-assisted coding: Experiments with GPT-4.CoRR, abs/2304.13187, 2023

work page arXiv 2023
[22]

Llm com- piler: Foundation language models for compiler optimization

Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather. Llm com- piler: Foundation language models for compiler optimization. In Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction, CC ’25, page 141–153, New York, NY, USA,

work page
[23]

Association for Computing Machinery

work page
[24]

TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment

Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, and Yiling Lou. TRANSAGENT: an llm-based multi-agent sys- tem for code translation.CoRR, abs/2409.19894, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

The rise and potential of large language model based agents: A survey.Sci

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, Qi Zhang, and Tao Gui. Th...

work page 2025
[26]

Carlos H. C. Ribeiro. Reinforcement learning agents.Artif. Intell. Rev., 17(3):223–250, 2002

work page 2002
[27]

Littman, and Andrew W

Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey.J. Artif. Intell. Res., 4:237–285, 1996

work page 1996
[28]

Steps toward artificial intelligence.Proceedings of the IRE, 49(1):8–30, 1961

Marvin Minsky. Steps toward artificial intelligence.Proceedings of the IRE, 49(1):8–30, 1961

work page 1961
[29]

Shelton, Michael J

Charles Lee Isbell Jr., Christian R. Shelton, Michael J. Kearns, Satinder Singh, and Peter Stone. A social reinforcement learning agent. In Elisabeth Andr ´e, Sandip Sen, Claude Frasson, and J¨org P . M¨uller, editors,Proceedings of the Fifth International Con- ference on Autonomous Agents, AGENTS 2001, Montreal, Canada, May 28 - June 1, 2001, pages 377–3...

work page 2001
[30]

A survey on large language model based autonomous agents.Frontiers Comput

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents.Frontiers Comput. Sci., 18(6):186345, 2024

work page 2024
[31]

A survey on the memory mechanism of large language model based agents.ACM Trans

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model based agents.ACM Trans. Inf. Syst., July 2025. Just Accepted

work page 2025
[32]

1968, Brussels, Scientific Affairs Division, NATO

Peter Naur and Brian Randell.Software Engineering: Report of a conference sponsored by the NATO Science Committee, Garmisch, Germany, 7-11 Oct. 1968, Brussels, Scientific Affairs Division, NATO. 1969

work page 1968
[33]

Dictionary of Computer Science, Engineering and Technology

Philip A Laplante, Naoufel Werghi, Christopher Lee Kuszmavl, Chris Verhof, Brian Henderson-Sellers, Joseph L Ganley, Ian Sommerville, Amos R Omondi, Ling Guan, Marco Gori, et al. Dictionary of Computer Science, Engineering and Technology. CRC Press, 2017

work page 2017
[34]

Barry W. Boehm. A view of 20th and 21st century software engineering. In Leon J. Osterweil, H. Dieter Rombach, and Mary Lou Soffa, editors,28th International Conference on Software Engineering (ICSE 2006), Shanghai, China, May 20-28, 2006, pages 12–29. ACM, 2006

work page 2006
[35]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3-9, 2024, pages 8048–8057. ij...

work page 2024
[36]

Exploring large language model based intelligent agents: Definitions, methods, and prospects

Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, and Xiuqiang He. Exploring large language model based intelligent agents: Definitions, methods, and prospects. CoRR, abs/2401.03428, 2024

work page arXiv 2024
[37]

Augmented language models: A survey.Trans

Gr ´egoire Mialon, Roberto Dess `ı, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Bap- tiste Rozi `ere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. Augmented language models: A survey.Trans. Mach. Learn. Res., 2023, 2023

work page 2023
[38]

Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents.J

Yue Liu, Sin Kit Lo, Qinghua Lu, Liming Zhu, Dehai Zhao, Xiwei Xu, Stefan Harrer, and Jon Whittle. Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents.J. Syst. Softw., 220:112278, 2025

work page 2025
[39]

Exploring autonomous agents through the lens of large language models: A review.CoRR, abs/2404.04442, 2024

Saikat Barua. Exploring autonomous agents through the lens of large language models: A review.CoRR, abs/2404.04442, 2024

work page arXiv 2024
[40]

A survey on large language models for code generation

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation. ACM Trans. Softw. Eng. Methodol., July 2025. Just Accepted

work page 2025
[41]

Llm-based multi- agent systems for software engineering: Literature review, vision, SEPTEMBER 2024 49 and the road ahead.ACM Trans

Junda He, Christoph Treude, and David Lo. Llm-based multi- agent systems for software engineering: Literature review, vision, SEPTEMBER 2024 49 and the road ahead.ACM Trans. Softw. Eng. Methodol., 34(5), May 2025

work page 2024
[42]

Zhang, Max Hort, Mark Harman, and Federica Sarro

Zhenpeng Chen, Jie M. Zhang, Max Hort, Mark Harman, and Federica Sarro. Fairness testing: A comprehensive survey and analysis of trends.ACM Trans. Softw. Eng. Methodol., 33(5):137:1– 137:59, 2024

work page 2024
[43]

A survey of compiler testing

Junjie Chen, Jibesh Patra, Michael Pradel, Yingfei Xiong, Hongyu Zhang, Dan Hao, and Lu Zhang. A survey of compiler testing. ACM Comput. Surv., 53(1):4:1–4:36, 2021

work page 2021
[44]

Find- ing trends in software research.IEEE Trans

George Mathew, Amritanshu Agrawal, and Tim Menzies. Find- ing trends in software research.IEEE Trans. Software Eng., 49(4):1397–1410, 2023

work page 2023
[45]

Empirical research in software engineering - A literature survey.J

Li Zhang, Jia-Hao Tian, Jing Jiang, Yi-Jun Liu, Meng-Yuan Pu, and Tao Yue. Empirical research in software engineering - A literature survey.J. Comput. Sci. Technol., 33(5):876–899, 2018

work page 2018
[46]

Zhang, Mark Harman, Lei Ma, and Yang Liu

Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. Machine learning testing: Survey, landscapes and horizons.IEEE Trans. Software Eng., 48(2):1–36, 2022

work page 2022
[47]

https://dblp.org, 2024

DBLP. https://dblp.org, 2024

work page 2024
[48]

https://blog.dblp.org/2024/01/01/ 7-million-publications/, 2024

7 million publications. https://blog.dblp.org/2024/01/01/ 7-million-publications/, 2024

work page 2024
[49]

https://arxiv.org/, 2024

arXiv. https://arxiv.org/, 2024

work page 2024
[50]

Opinion mining for software development: A systematic literature review.ACM Trans

Bin Lin, Nathan Cassee, Alexander Serebrenik, Gabriele Bavota, Nicole Novielli, and Michele Lanza. Opinion mining for software development: A systematic literature review.ACM Trans. Softw. Eng. Methodol., 31(3):38:1–38:41, 2022

work page 2022
[51]

RWTH, Fachgruppe Informatik Aachen, 1996

Klaus Pohl.Requirements Engineering: An Overview. RWTH, Fachgruppe Informatik Aachen, 1996

work page 1996
[52]

Easterbrook

Bashar Nuseibeh and Steve M. Easterbrook. Requirements engi- neering: A roadmap. In Anthony Finkelstein, editor,22nd Inter- national Conference on on Software Engineering, Future of Software Engineering Track, ICSE 2000, Limerick Ireland, June 4-11, 2000, pages 35–46. ACM, 2000

work page 2000
[53]

Requirements engineering: A survey.Communications on Applied Electronics, 3(5):28–31, 2015

Vivek Shukla, Dhirendra Pandey, and Raj Shree. Requirements engineering: A survey.Communications on Applied Electronics, 3(5):28–31, 2015

work page 2015
[54]

The unified modeling language.Unix Review, 14(13):5, 1996

Grady Booch, Ivar Jacobson, James Rumbaugh, et al. The unified modeling language.Unix Review, 14(13):5, 1996

work page 1996
[55]

Entity- relationship-attribute designs and sketches.Theory and Applica- tions of Categories, 10(3):94–112, 2002

Michael Johnson, Robert Rosebrugh, and RJ Wood. Entity- relationship-attribute designs and sketches.Theory and Applica- tions of Categories, 10(3):94–112, 2002

work page 2002
[56]

Marcos, and J

Alejandro Rago, Claudia A. Marcos, and J. Andr ´es D ´ıaz Pace. Uncovering quality-attribute concerns in use case specifications via early aspect mining.Requir. Eng., 18(1):67–84, 2013

work page 2013
[57]

The applications of natural language processing (NLP) for software requirement engineering - A systematic literature review

Farhana Nazir, Wasi Haider Butt, Muhammad Waseem Anwar, and Muazzam Ali Khan Khattak. The applications of natural language processing (NLP) for software requirement engineering - A systematic literature review. In Kuinam Kim and Nikolai Joukov, editors,Information Science and Applications 2017 - ICISA 2017, Macau, China, 20-23 March 2017, volume 424 ofLec...

work page 2017
[58]

Automatically classifying user requests in crowdsourcing requirements engineering.J

Chuanyi Li, Liguo Huang, Jidong Ge, Bin Luo, and Vincent Ng. Automatically classifying user requests in crowdsourcing requirements engineering.J. Syst. Softw., 138:108–123, 2018

work page 2018
[59]

Advances in au- tomated support for requirements engineering: A systematic literature review.Requir

Muhammad Aminu Umar and Kevin Lano. Advances in au- tomated support for requirements engineering: A systematic literature review.Requir. Eng., 29(2):177–207, 2024

work page 2024
[60]

PRCBERT: prompt learning for requirement classification using bert-based pretrained language models

Xianchang Luo, Yinxing Xue, Zhenchang Xing, and Jiamou Sun. PRCBERT: prompt learning for requirement classification using bert-based pretrained language models. In37th IEEE/ACM Inter- national Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022, pages 75:1–75:13. ACM, 2022

work page 2022
[61]

Using llms in software requirements specifications: An empirical evaluation

Madhava Krishna, Bhagesh Gaur, Arsh Verma, and Pankaj Jalote. Using llms in software requirements specifications: An empirical evaluation. In Grischa Liebel, Irit Hadar, and Paola Spoletini, ed- itors,32nd IEEE International Requirements Engineering Conference, RE 2024, Reykjavik, Iceland, June 24-28, 2024, pages 475–483. IEEE, 2024

work page 2024
[62]

Empirical evaluation of chatgpt on requirements information retrieval under zero-shot setting

Jianzhang Zhang, Yiyang Chen, Chuang Liu, Nan Niu, and Yinglin Wang. Empirical evaluation of chatgpt on requirements information retrieval under zero-shot setting. In2023 Inter- national Conference on Intelligent Computing and Next Generation Networks (ICNGN), pages 1–6. IEEE, 2023

work page 2023
[63]

Krishna Ronanki, Beatriz Cabrero Daniel, and Christian Berger. Chatgpt as a tool for user story quality evaluation: Trustworthy out of the box? In Philippe Kruchten and Peggy Gregory, editors, Agile Processes in Software Engineering and Extreme Programming - Workshops - XP 2022 Workshops, Copenhagen, Denmark, June 13-17, 2022, and XP 2023 Workshops, Amste...

work page 2022
[64]

Improving requirements completeness: Automated assistance through large language models.Requir

Dipeeka Luitel, Shabnam Hassani, and Mehrdad Sabetzadeh. Improving requirements completeness: Automated assistance through large language models.Requir. Eng., 29(1):73–95, 2024

work page 2024
[65]

Mohammadmehdi Ataei, Hyunmin Cheong, Daniele Grandi, Ye Wang, Nigel Morris, and Alexander Tessier. Elicitron: A large language model agent-based simulation framework for design requirements elicitation.Journal of Computing and Information Science in Engineering, 25(2):021012, 01 2025

work page 2025
[66]

Specgen: Automated generation of formal program specifications via large language models

Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. Specgen: Automated generation of formal program specifications via large language models. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 - May 6, 2025, pages 16–28. IEEE, 2025

work page 2025
[67]

Springer Nature Switzerland, Cham, 2024

Chetan Arora, John Grundy, and Mohamed Abdelrazek.Advanc- ing Requirements Engineering Through Generative AI: Assessing the Role of LLMs, pages 129–148. Springer Nature Switzerland, Cham, 2024

work page 2024
[68]

Mare: Multi-agents col- laboration framework for requirements engineering,

Dongming Jin, Zhi Jin, Xiaohong Chen, and Chunhui Wang. MARE: multi-agents collaboration framework for requirements engineering.CoRR, abs/2405.03256, 2024

work page arXiv 2024
[69]

David R. Cok. Openjml: JML for java 7 by extending openjdk. In Mihaela Gheorghiu Bobaru, Klaus Havelund, Gerard J. Holz- mann, and Rajeev Joshi, editors,NASA Formal Methods - Third International Symposium, NFM 2011, Pasadena, CA, USA, April 18- 20, 2011. Proceedings, volume 6617 ofLecture Notes in Computer Science, pages 472–479. Springer, 2011

work page 2011
[70]

Rustan M

Cormac Flanagan and K. Rustan M. Leino. Houdini, an anno- tation assistant for esc/java. In Jos ´e Nuno Oliveira and Pamela Zave, editors,FME 2001: Formal Methods for Increasing Software Productivity, International Symposium of Formal Methods Europe, Berlin, Germany, March 12-16, 2001, Proceedings, volume 2021 of Lecture Notes in Computer Science, pages 5...

work page 2001
[71]

Ernst, Jeff H

Michael D. Ernst, Jeff H. Perkins, Philip J. Guo, Stephen McCa- mant, Carlos Pacheco, Matthew S. Tschantz, and Chen Xiao. The daikon system for dynamic detection of likely invariants.Sci. Comput. Program., 69(1-3):35–45, 2007

work page 2007
[72]

Zhang, Yang Liu, and Yun Ma

Yaoqi Guo, Zhenpeng Chen, Jie M. Zhang, Yang Liu, and Yun Ma. Personality-guided code generation using large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1068– 1080, Vienna, Austria, July 2025. Association for Computational Linguistics

work page 2025
[73]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the AI ocean: A survey on hallucination in large language models.CoRR, abs/2309.01219, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing System...

work page 2023
[75]

Fully autonomous programming with large language models

Vadim Liventsev, Anastasiia Grishina, Aki H ¨arm¨a, and Leon Moonen. Fully autonomous programming with large language models. In Sara Silva and Lu ´ıs Paquete, editors,Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2023, Lisbon, Portugal, July 15-19, 2023, pages 1146–1155. ACM, 2023

work page 2023
[76]

Olausson, Jeevana Priya Inala, Chenglong Wang, Jian- feng Gao, and Armando Solar-Lezama

Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jian- feng Gao, and Armando Solar-Lezama. Is self-repair a silver bullet for code generation? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

work page 2024
[78]

Autogen: Enabling next-gen LLM applications via multi- agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi- agent conversations. InFirst Conference on Language Modeling, 2024

work page 2024
[79]

INTERVENOR: prompting the coding ability of large language models with the interactive chain of repair

Hanbin Wang, Zhenghao Liu, Shuo Wang, Ganqu Cui, Ning Ding, Zhiyuan Liu, and Ge Yu. INTERVENOR: prompting the coding ability of large language models with the interactive chain of repair. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, SEPTEMBER 2024 50 editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and vi...

work page 2024
[80]

Test-driven development and llm-based code generation

Noble Saji Mathews and Meiyappan Nagappan. Test-driven development and llm-based code generation. In Vladimir Filkov, Baishakhi Ray, and Minghui Zhou, editors,Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engi- neering, ASE 2024, Sacramento, CA, USA, October 27 - November 1, 2024, pages 1583–1594. ACM, 2024

work page 2024
[81]

Autocoder: Enhancing code large language model with aiev-instruct.CoRR, abs/2405.14906, 2024

Bin Lei, Yuchen Li, and Qiuwu Chen. Autocoder: Enhancing code large language model with aiev-instruct.CoRR, abs/2405.14906, 2024

work page arXiv 2024
[82]

Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules

Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq Joty. Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024

Showing first 80 references.