PBT-Bench is a new benchmark with 100 property-based testing problems across 40 Python libraries that measures LLM bug recall rates of 42.1-83.4% under guided prompting versus 31.4-76.7% in baseline.
Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages =
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7roles
method 1polarities
use method 1representative citing papers
UniCoder applies symbolic attribute alignment via an auxiliary LLM and reference-guided optimization in RL to achieve SOTA visual-to-code generation on ChartMimic, UniSVG, Design2Code, and ScreenBench.
CAPED reduces incidental visual privacy leakage in mobile GUI agents from 0.766 to 0.268 on seeded AndroidWorld tasks by selectively exposing only task-relevant screen content.
SPARK improves LLM-based test code fault localization by retrieving similar past faults and selectively annotating suspicious lines in new failing tests.
Analysis of 252 bug fixes in an LLM-powered multi-market web app found 44% escaped through four seams invisible to component unit tests, motivating a four-seam verification framework.
LLM-generated unit tests with retrieval-augmented context detect faults in 69% of real Python bugs versus 17.2% for general-purpose human-written tests, with similar coverage levels.
MultiMend augments buggy function context via retrieval and generates multi-hunk patches, fixing 2,227 of 5,501 bugs across six benchmarks in four languages.
citing papers explorer
No citing papers match the current filters.