hub

Defects4j: a database of existing faults to enable controlled testing studies for java programs

René Just, Darioush Jalali, Michael D · 2014 · arXiv 0384.262805

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

BioDefect: The First Dataset for Defect Detection in Bioinformatics Software

cs.SE · 2026-05-20 · unverdicted · novelty 7.0

BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.

CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation

cs.SE · 2026-04-21 · unverdicted · novelty 7.0

CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

cs.SE · 2025-11-02 · unverdicted · novelty 7.0

Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.

cs.SE · 2026-05-08 · unverdicted · novelty 6.0

SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.

Reproduction Test Generation for Java SWE Issues

cs.SE · 2026-05-05 · unverdicted · novelty 6.0

Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.

From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

cs.SE · 2026-04-20 · conditional · novelty 6.0 · 2 refs

Targeted, evidence-rich context partitions improve causal clarity and actionability of LLM failure explanations while large undifferentiated contexts produce vaguer outputs, with higher-quality explanations correlating to better downstream repair rates.

AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

cs.SE · 2026-04-13 · conditional · novelty 6.0

AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.

PAFT: Preservation Aware Fine-Tuning for Minimal-Edit Program Repair

cs.SE · 2026-04-03 · unverdicted · novelty 6.0

PAFT improves LLM-based program repair pass rates by up to 65.6% while cutting average edit distance by up to 32.6% through explicit preservation signals and curriculum training.

MutDafny: A Mutation-Based Approach to Assess Dafny Specifications

cs.SE · 2025-11-19 · conditional · novelty 6.0

MutDafny uses 40 mutation operators on 794 real-world Dafny programs to detect weak specifications, manually confirming five such cases at a rate of one per 241 lines.

PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction

cs.SE · 2026-05-21 · conditional · novelty 5.0

PITMuS automates source-level bug dataset generation by mapping PIT bytecode mutants back to Java source using debug information, producing structured pairs and metadata evaluated on eight open-source systems.

BugForge: Constructing and Utilizing DBMS Bug Repository to Enhance DBMS Testing

cs.SE · 2026-04-03 · unverdicted · novelty 5.0

BugForge constructs a unified DBMS bug repository from 37,632 reports spanning 28 years and uses it to generate test cases that uncovered 35 previously unknown bugs across PostgreSQL, MySQL, MariaDB, and MonetDB.

LLM-Based Automated Diagnosis Of Integration Test Failures At Google

cs.SE · 2026-04-13 · unverdicted · novelty 4.0

Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

cs.SE · 2026-05-13 · 2 refs

citing papers explorer

Showing 13 of 13 citing papers.

BioDefect: The First Dataset for Defect Detection in Bioinformatics Software cs.SE · 2026-05-20 · unverdicted · none · ref 21
BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation cs.SE · 2026-04-21 · unverdicted · none · ref 15
CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.
Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems cs.SE · 2025-11-02 · unverdicted · none · ref 30
Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization cs.SE · 2026-05-08 · unverdicted · none · ref 27
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
Reproduction Test Generation for Java SWE Issues cs.SE · 2026-05-05 · unverdicted · none · ref 17
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge cs.SE · 2026-04-20 · conditional · none · ref 11 · 2 links
Targeted, evidence-rich context partitions improve causal clarity and actionability of LLM failure explanations while large undifferentiated contexts produce vaguer outputs, with higher-quality explanations correlating to better downstream repair rates.
AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection cs.SE · 2026-04-13 · conditional · none · ref 26
AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.
PAFT: Preservation Aware Fine-Tuning for Minimal-Edit Program Repair cs.SE · 2026-04-03 · unverdicted · none · ref 17
PAFT improves LLM-based program repair pass rates by up to 65.6% while cutting average edit distance by up to 32.6% through explicit preservation signals and curriculum training.
MutDafny: A Mutation-Based Approach to Assess Dafny Specifications cs.SE · 2025-11-19 · conditional · none · ref 31
MutDafny uses 40 mutation operators on 794 real-world Dafny programs to detect weak specifications, manually confirming five such cases at a rate of one per 241 lines.
PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction cs.SE · 2026-05-21 · conditional · none · ref 8
PITMuS automates source-level bug dataset generation by mapping PIT bytecode mutants back to Java source using debug information, producing structured pairs and metadata evaluated on eight open-source systems.
BugForge: Constructing and Utilizing DBMS Bug Repository to Enhance DBMS Testing cs.SE · 2026-04-03 · unverdicted · none · ref 7
BugForge constructs a unified DBMS bug repository from 37,632 reports spanning 28 years and uses it to generate test cases that uncovered 35 previously unknown bugs across PostgreSQL, MySQL, MariaDB, and MonetDB.
LLM-Based Automated Diagnosis Of Integration Test Failures At Google cs.SE · 2026-04-13 · unverdicted · none · ref 19
Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.
PBT-Bench: Benchmarking AI Agents on Property-Based Testing cs.SE · 2026-05-13 · unreviewed · ref 7 · 2 links

Defects4j: a database of existing faults to enable controlled testing studies for java programs

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer