Refactoring Runaway

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, Abhik Roychoudhury · 2024 · arXiv 0212.36803

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Investigating Test Overfitting on SWE-bench

cs.SE · 2025-11-20 · unverdicted · novelty 7.0

The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.

Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs

cs.SE · 2025-09-22 · unverdicted · novelty 7.0

Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to proprietary models.

"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution

cs.SE · 2026-05-21 · unverdicted · novelty 5.0

Empirical study finds coding agents produce fewer and less intense tangled refactorings than humans on Multi-SWE-bench; a refactoring-aware refinement improves compilability from 19.34% to 38.33% and resolves 2.79% more issues.

citing papers explorer

Showing 3 of 3 citing papers.

Investigating Test Overfitting on SWE-bench cs.SE · 2025-11-20 · unverdicted · none · ref 23
The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.
Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs cs.SE · 2025-09-22 · unverdicted · none · ref 45
Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to proprietary models.
"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution cs.SE · 2026-05-21 · unverdicted · none · ref 56
Empirical study finds coding agents produce fewer and less intense tangled refactorings than humans on Multi-SWE-bench; a refactoring-aware refinement improves compilability from 19.34% to 38.33% and resolves 2.79% more issues.

Refactoring Runaway

fields

years

verdicts

representative citing papers

citing papers explorer