SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.
An empirical study on the code refactoring capability of large language models
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.SE 7roles
background 2representative citing papers
Analysis of GitHub commits shows developers mostly accept LLM refactoring suggestions without changes, with modifications clustering into five patterns based on activity, prompt, and response validity.
Foundation models serve as effective oracles for detecting refactoring correctness issues in Java programs, achieving up to 93.8% accuracy in zero-shot evaluations on 226 real bugs.
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.
The study identifies 13 categories of code smells in build scripts, detects 10,895 occurrences across 5882 scripts from 4877 repositories, and finds common patterns like insecure URLs in Maven and hardcoded paths in Gradle and CMake.
More capable LLMs and agents generate code with greater volume and architectural decay, following a Volume-Quality Inverse Law that neither functional correctness nor prompting mitigates.
Survey mapping LLM applications in software quality assurance to established standards including ISO/IEC 12207, ISO 25010, CMMI, and TMM, with case studies, challenges, and future directions.
citing papers explorer
-
SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair
SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.
-
Patterns of Developer Adoption of LLM-Generated Code Refactoring Suggestions
Analysis of GitHub commits shows developers mostly accept LLM refactoring suggestions without changes, with modifications clustering into five patterns based on activity, prompt, and response validity.
-
Foundation Models as Oracles for Refactoring Correctness Detection
Foundation models serve as effective oracles for detecting refactoring correctness issues in Java programs, achieving up to 93.8% accuracy in zero-shot evaluations on 226 real bugs.
-
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.
-
Your Build Scripts Stink: The State of Code Smells in Build Scripts
The study identifies 13 categories of code smells in build scripts, detects 10,895 occurrences across 5882 scripts from 4877 repositories, and finds common patterns like insecure URLs in Maven and hardcoded paths in Gradle and CMake.
-
AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development
More capable LLMs and agents generate code with greater volume and architectural decay, following a Volume-Quality Inverse Law that neither functional correctness nor prompting mitigates.
-
A Blueprint for AI-Driven Software Quality: Integrating LLMs with Established Standards
Survey mapping LLM applications in software quality assurance to established standards including ISO/IEC 12207, ISO 25010, CMMI, and TMM, with case studies, challenges, and future directions.