hub

Ernst, Reid Holmes, and Gordon Fraser

doi: 10 · 2014 · arXiv 0384.262805

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

BioDefect: The First Dataset for Defect Detection in Bioinformatics Software

cs.SE · 2026-05-20 · unverdicted · novelty 7.0

BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

cs.SE · 2026-05-13 · unverdicted · novelty 7.0 · 3 refs

PBT-Bench is a new benchmark with 100 property-based testing problems across 40 Python libraries that measures LLM bug recall rates of 42.1-83.4% under guided prompting versus 31.4-76.7% in baseline.

CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation

cs.SE · 2026-04-21 · unverdicted · novelty 7.0

CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

cs.SE · 2025-11-02 · unverdicted · novelty 7.0

Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.

Multi-task LLMs for Bug Classification: Efficient Inference with Auxiliary Decoding Heads

cs.SE · 2026-06-08 · unverdicted · novelty 6.0

A multi-task LLM approach for efficient line-level bug localization using auxiliary decoding heads and token alignment.

cs.SE · 2026-05-08 · unverdicted · novelty 6.0

SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.

Reproduction Test Generation for Java SWE Issues

cs.SE · 2026-05-05 · unverdicted · novelty 6.0

Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.

From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

cs.SE · 2026-04-20 · conditional · novelty 6.0 · 2 refs

Targeted, evidence-rich context partitions improve causal clarity and actionability of LLM failure explanations while large undifferentiated contexts produce vaguer outputs, with higher-quality explanations correlating to better downstream repair rates.

AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

cs.SE · 2026-04-13 · conditional · novelty 6.0

AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.

PAFT: Preservation Aware Fine-Tuning for Minimal-Edit Program Repair

cs.SE · 2026-04-03 · unverdicted · novelty 6.0

PAFT improves LLM-based program repair pass rates by up to 65.6% while cutting average edit distance by up to 32.6% through explicit preservation signals and curriculum training.

MutDafny: A Mutation-Based Approach to Assess Dafny Specifications

cs.SE · 2025-11-19 · conditional · novelty 6.0

MutDafny uses 40 mutation operators on 794 real-world Dafny programs to detect weak specifications, manually confirming five such cases at a rate of one per 241 lines.

All Green, Still Broken: Real-Flow Verification Lessons from an LLM-Integrated, Multi-Market Web Application

cs.SE · 2026-06-21 · unverdicted · novelty 5.0

Analysis of 252 bug fixes in an LLM-powered multi-market web app found 44% escaped through four seams invisible to component unit tests, motivating a four-seam verification framework.

LLM vs. Human Unit Tests: Fault Detection on Real Python Bugs

cs.SE · 2026-06-07 · unverdicted · novelty 5.0

LLM-generated unit tests with retrieval-augmented context detect faults in 69% of real Python bugs versus 17.2% for general-purpose human-written tests, with similar coverage levels.

PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction

cs.SE · 2026-05-21 · conditional · novelty 5.0

PITMuS automates source-level bug dataset generation by mapping PIT bytecode mutants back to Java source using debug information, producing structured pairs and metadata evaluated on eight open-source systems.

BugForge: Constructing and Utilizing DBMS Bug Repository to Enhance DBMS Testing

cs.SE · 2026-04-03 · unverdicted · novelty 5.0

BugForge constructs a unified DBMS bug repository from 37,632 reports spanning 28 years and uses it to generate test cases that uncovered 35 previously unknown bugs across PostgreSQL, MySQL, MariaDB, and MonetDB.

LLM-Based Automated Diagnosis Of Integration Test Failures At Google

cs.SE · 2026-04-13 · unverdicted · novelty 4.0

Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.

citing papers explorer

Showing 14 of 14 citing papers after filters.

BioDefect: The First Dataset for Defect Detection in Bioinformatics Software cs.SE · 2026-05-20 · unverdicted · none · ref 21
BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
PBT-Bench: Benchmarking AI Agents on Property-Based Testing cs.SE · 2026-05-13 · unverdicted · none · ref 7 · 3 links
PBT-Bench is a new benchmark with 100 property-based testing problems across 40 Python libraries that measures LLM bug recall rates of 42.1-83.4% under guided prompting versus 31.4-76.7% in baseline.
CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation cs.SE · 2026-04-21 · unverdicted · none · ref 15
CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.
Multi-task LLMs for Bug Classification: Efficient Inference with Auxiliary Decoding Heads cs.SE · 2026-06-08 · unverdicted · none · ref 10
A multi-task LLM approach for efficient line-level bug localization using auxiliary decoding heads and token alignment.
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization cs.SE · 2026-05-08 · unverdicted · none · ref 27
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
Reproduction Test Generation for Java SWE Issues cs.SE · 2026-05-05 · unverdicted · none · ref 17
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge cs.SE · 2026-04-20 · conditional · none · ref 11 · 2 links
Targeted, evidence-rich context partitions improve causal clarity and actionability of LLM failure explanations while large undifferentiated contexts produce vaguer outputs, with higher-quality explanations correlating to better downstream repair rates.
AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection cs.SE · 2026-04-13 · conditional · none · ref 26
AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.
PAFT: Preservation Aware Fine-Tuning for Minimal-Edit Program Repair cs.SE · 2026-04-03 · unverdicted · none · ref 17
PAFT improves LLM-based program repair pass rates by up to 65.6% while cutting average edit distance by up to 32.6% through explicit preservation signals and curriculum training.
All Green, Still Broken: Real-Flow Verification Lessons from an LLM-Integrated, Multi-Market Web Application cs.SE · 2026-06-21 · unverdicted · none · ref 10
Analysis of 252 bug fixes in an LLM-powered multi-market web app found 44% escaped through four seams invisible to component unit tests, motivating a four-seam verification framework.
LLM vs. Human Unit Tests: Fault Detection on Real Python Bugs cs.SE · 2026-06-07 · unverdicted · none · ref 13
LLM-generated unit tests with retrieval-augmented context detect faults in 69% of real Python bugs versus 17.2% for general-purpose human-written tests, with similar coverage levels.
PITMuS: A Tool for Automated Bug Dataset Generation via Source-Level Mutant Reconstruction cs.SE · 2026-05-21 · conditional · none · ref 8
PITMuS automates source-level bug dataset generation by mapping PIT bytecode mutants back to Java source using debug information, producing structured pairs and metadata evaluated on eight open-source systems.
BugForge: Constructing and Utilizing DBMS Bug Repository to Enhance DBMS Testing cs.SE · 2026-04-03 · unverdicted · none · ref 7
BugForge constructs a unified DBMS bug repository from 37,632 reports spanning 28 years and uses it to generate test cases that uncovered 35 previously unknown bugs across PostgreSQL, MySQL, MariaDB, and MonetDB.
LLM-Based Automated Diagnosis Of Integration Test Failures At Google cs.SE · 2026-04-13 · unverdicted · none · ref 19
Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.

Ernst, Reid Holmes, and Gordon Fraser

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer