BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
hub
Ernst, Reid Holmes, and Gordon Fraser
16 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
fields
cs.SE 16roles
background 4polarities
background 4representative citing papers
PBT-Bench is a new benchmark with 100 property-based testing problems across 40 Python libraries that measures LLM bug recall rates of 42.1-83.4% under guided prompting versus 31.4-76.7% in baseline.
CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.
Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.
A multi-task LLM approach for efficient line-level bug localization using auxiliary decoding heads and token alignment.
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
Targeted, evidence-rich context partitions improve causal clarity and actionability of LLM failure explanations while large undifferentiated contexts produce vaguer outputs, with higher-quality explanations correlating to better downstream repair rates.
AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.
PAFT improves LLM-based program repair pass rates by up to 65.6% while cutting average edit distance by up to 32.6% through explicit preservation signals and curriculum training.
MutDafny uses 40 mutation operators on 794 real-world Dafny programs to detect weak specifications, manually confirming five such cases at a rate of one per 241 lines.
Analysis of 252 bug fixes in an LLM-powered multi-market web app found 44% escaped through four seams invisible to component unit tests, motivating a four-seam verification framework.
LLM-generated unit tests with retrieval-augmented context detect faults in 69% of real Python bugs versus 17.2% for general-purpose human-written tests, with similar coverage levels.
PITMuS automates source-level bug dataset generation by mapping PIT bytecode mutants back to Java source using debug information, producing structured pairs and metadata evaluated on eight open-source systems.
BugForge constructs a unified DBMS bug repository from 37,632 reports spanning 28 years and uses it to generate test cases that uncovered 35 previously unknown bugs across PostgreSQL, MySQL, MariaDB, and MonetDB.
Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.
citing papers explorer
-
BioDefect: The First Dataset for Defect Detection in Bioinformatics Software
BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
-
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
PBT-Bench is a new benchmark with 100 property-based testing problems across 40 Python libraries that measures LLM bug recall rates of 42.1-83.4% under guided prompting versus 31.4-76.7% in baseline.
-
CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation
CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.
-
Multi-task LLMs for Bug Classification: Efficient Inference with Auxiliary Decoding Heads
A multi-task LLM approach for efficient line-level bug localization using auxiliary decoding heads and token alignment.
-
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
-
Reproduction Test Generation for Java SWE Issues
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
-
From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge
Targeted, evidence-rich context partitions improve causal clarity and actionability of LLM failure explanations while large undifferentiated contexts produce vaguer outputs, with higher-quality explanations correlating to better downstream repair rates.
-
AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection
AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.
-
PAFT: Preservation Aware Fine-Tuning for Minimal-Edit Program Repair
PAFT improves LLM-based program repair pass rates by up to 65.6% while cutting average edit distance by up to 32.6% through explicit preservation signals and curriculum training.
-
All Green, Still Broken: Real-Flow Verification Lessons from an LLM-Integrated, Multi-Market Web Application
Analysis of 252 bug fixes in an LLM-powered multi-market web app found 44% escaped through four seams invisible to component unit tests, motivating a four-seam verification framework.
-
LLM vs. Human Unit Tests: Fault Detection on Real Python Bugs
LLM-generated unit tests with retrieval-augmented context detect faults in 69% of real Python bugs versus 17.2% for general-purpose human-written tests, with similar coverage levels.
-
PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction
PITMuS automates source-level bug dataset generation by mapping PIT bytecode mutants back to Java source using debug information, producing structured pairs and metadata evaluated on eight open-source systems.
-
BugForge: Constructing and Utilizing DBMS Bug Repository to Enhance DBMS Testing
BugForge constructs a unified DBMS bug repository from 37,632 reports spanning 28 years and uses it to generate test cases that uncovered 35 previously unknown bugs across PostgreSQL, MySQL, MariaDB, and MonetDB.
-
LLM-Based Automated Diagnosis Of Integration Test Failures At Google
Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.