R2ABench benchmark shows LLMs generate syntactically valid software architectures from requirements but produce structurally fragmented results due to weak relational reasoning.
Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
A systematic review that categorizes prompting strategies for LLM-based code summarization, assesses their effectiveness, and identifies gaps in research and evaluation practices.
citing papers explorer
-
Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation
R2ABench benchmark shows LLMs generate syntactically valid software architectures from requirements but produce structurally fragmented results due to weak relational reasoning.
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
Prompt-Driven Code Summarization: A Systematic Literature Review
A systematic review that categorizes prompting strategies for LLM-based code summarization, assesses their effectiveness, and identifies gaps in research and evaluation practices.