PBT-Bench is a new benchmark of 100 property-based testing problems with 365 injected semantic bugs across 40 Python libraries that measures LLMs on deriving invariants and precise input-generation strategies.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.SE 3verdicts
UNVERDICTED 3roles
method 1polarities
use method 1representative citing papers
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
MultiMend augments buggy function context via retrieval and generates multi-hunk patches, fixing 2,227 of 5,501 bugs across six benchmarks in four languages.
citing papers explorer
-
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
PBT-Bench is a new benchmark of 100 property-based testing problems with 365 injected semantic bugs across 40 Python libraries that measures LLMs on deriving invariants and precise input-generation strategies.
-
Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
-
MultiMend: Multilingual Program Repair with Context Augmentation and Multi-Hunk Patch Generation
MultiMend augments buggy function context via retrieval and generates multi-hunk patches, fixing 2,227 of 5,501 bugs across six benchmarks in four languages.