pith:MQ7SPMCD
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
A new benchmark shows LLMs using Hypothesis scaffolding can recall 42 to 83 percent of injected semantic bugs by turning library documentation into precise property-based tests.
arxiv:2605.15229 v1 · 2026-05-13 · cs.SE · cs.AI
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{MQ7SPMCD6656MLH2JYE2DVOWIH}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation.
The 365 injected bugs and the three difficulty strata (L1-L3) are representative of real semantic bugs that would matter in production Python libraries, and that the curation process did not inadvertently favor bugs that current LLMs happen to be good or bad at detecting.
PBT-Bench evaluates eight LLMs on 100 property-based testing tasks requiring derivation of invariants from docs and construction of targeted input generators to reveal 365 injected semantic bugs.
References
Formal links
Receipt and verification
| First computed | 2026-05-20T00:00:47.409729Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
643f27b043f7bbe62cfa4e09a1d5d641d7a618f996dace6e79adef335de56b9a
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/MQ7SPMCD6656MLH2JYE2DVOWIH \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 643f27b043f7bbe62cfa4e09a1d5d641d7a618f996dace6e79adef335de56b9a
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "29cb9b7070890324b6fef319ba769377b23bcf17188322794f58b62876b0df22",
"cross_cats_sorted": [
"cs.AI"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.SE",
"submitted_at": "2026-05-13T18:01:05Z",
"title_canon_sha256": "c2ae6a7478549fdba58e03d4387701cb6e72cd4752bab4a6e85fc51c8f23d037"
},
"schema_version": "1.0",
"source": {
"id": "2605.15229",
"kind": "arxiv",
"version": 1
}
}