LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
Beyond correctness: Benchmarking multi-dimensional code generation for large language models
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2representative citing papers
MetaLint uses meta-learning to let models generalize from easy synthetic linting data to hard human-curated best practices, yielding large F-score gains on a new PEP-inspired benchmark.
LLMs frequently reverse their stated coding preferences when shown actual code instead of descriptions, show positional bias, and produce more polarized ratings than human experts on complexity, commenting, modularity, and readability.
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
LLM-generated unit tests with retrieval-augmented context detect faults in 69% of real Python bugs versus 17.2% for general-purpose human-written tests, with similar coverage levels.
LLM assistance shortens idea-generation periods and reduces creative moments during programming tasks while yielding solutions with comparable idea counts and greater functional correctness.
citing papers explorer
-
Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software
LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
-
MetaLint: Easy-to-Hard Generalization for Code Linting
MetaLint uses meta-learning to let models generalize from easy synthetic linting data to hard human-curated best practices, yielding large F-score gains on a new PEP-inspired benchmark.
-
Subjective Code Preferences in Experts and Large Language Models
LLMs frequently reverse their stated coding preferences when shown actual code instead of descriptions, show positional bias, and produce more polarized ratings than human experts on complexity, commenting, modularity, and readability.
-
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
-
LLM vs. Human Unit Tests: Fault Detection on Real Python Bugs
LLM-generated unit tests with retrieval-augmented context detect faults in 69% of real Python bugs versus 17.2% for general-purpose human-written tests, with similar coverage levels.
-
"Like Taking the Path of Least Resistance": Exploring the Impact of LLM Interaction on the Creative Process of Programming
LLM assistance shortens idea-generation periods and reduces creative moments during programming tasks while yielding solutions with comparable idea counts and greater functional correctness.