From bugs to bench- marks: A comprehensive survey of software defect datasets

Hao-Nan Zhu, Robert M · 2025 · arXiv 2504.17977

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks

cs.SE · 2025-10-20 · unverdicted · novelty 7.0

JunoBench is the first benchmark of 111 reproducible crashes in Python ML Jupyter notebooks from Kaggle, with verified fixes and rich annotations for bug research.

What Makes Software Bugs Escape Testing? Evidence from a Large-Scale Empirical Study

cs.SE · 2026-04-29 · unverdicted · novelty 6.0

Post-release defects concentrate in older, frequently modified high-churn components and require longer and more complex fixes than pre-release defects.

Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset

cs.SE · 2026-04-29 · unverdicted · novelty 5.0

21.6% of Defects4J defects are unsuitable and 7.1% have under-specified test suites for reproducible APR evaluation.

citing papers explorer

Showing 2 of 2 citing papers after filters.

What Makes Software Bugs Escape Testing? Evidence from a Large-Scale Empirical Study cs.SE · 2026-04-29 · unverdicted · none · ref 6
Post-release defects concentrate in older, frequently modified high-churn components and require longer and more complex fixes than pre-release defects.
Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset cs.SE · 2026-04-29 · unverdicted · none · ref 2
21.6% of Defects4J defects are unsuitable and 7.1% have under-specified test suites for reproducible APR evaluation.

From bugs to bench- marks: A comprehensive survey of software defect datasets

fields

years

verdicts

representative citing papers

citing papers explorer