Sakura is a multi-agent system that generates structurally complex tests from NL descriptions, achieving 50-78% higher compilability and 38-66% higher coverage overlap than baselines on 1,464 scenarios from 20 Apache Commons applications.
TDD-Bench verified: Can LLMs generate tests for issues before they get resolved?arXiv preprint arXiv:2412.02883
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.SE 9representative citing papers
TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
TestPrune minimizes regression test suites to improve bug reproduction and patch validation in LLM-based agentic repair pipelines, delivering 6-13% relative gains on SWE-Bench benchmarks at low API cost.
citing papers explorer
-
Investigating Test Overfitting on SWE-bench
The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.
-
Can Old Tests Do New Tricks for Resolving SWE Issues?
TestPrune minimizes regression test suites to improve bug reproduction and patch validation in LLM-based agentic repair pipelines, delivering 6-13% relative gains on SWE-Bench benchmarks at low API cost.