{"paper":{"title":"PBT-Bench: Benchmarking AI Agents on Property-Based Testing","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A new benchmark shows LLMs using Hypothesis scaffolding can recall 42 to 83 percent of injected semantic bugs by turning library documentation into precise property-based tests.","cross_cats":["cs.AI"],"primary_cat":"cs.SE","authors_text":"Liao Zhang, Lucas Jing, Simon S. Du, Xinqi Wang","submitted_at":"2026-05-13T18:01:05Z","abstract_excerpt":"Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a\n  patch that fixes a described issue. Neither isolates the distinct skill of property-based testing: deriving a semantic invariant\n  from documentation, and then constructing an input-generation strategy precise enough to make a random search reveal the violation.\n  We introduce PBT-Bench, a benchmark of 100 curated property-based testing problems across 40 real Python libraries. Each problem\n  injects one or more semantic bugs (365 in total, mean 3.65 per problem) de"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The 365 injected bugs and the three difficulty strata (L1-L3) are representative of real semantic bugs that would matter in production Python libraries, and that the curation process did not inadvertently favor bugs that current LLMs happen to be good or bad at detecting.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"PBT-Bench evaluates eight LLMs on 100 property-based testing tasks requiring derivation of invariants from docs and construction of targeted input generators to reveal 365 injected semantic bugs.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A new benchmark shows LLMs using Hypothesis scaffolding can recall 42 to 83 percent of injected semantic bugs by turning library documentation into precise property-based tests.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a44f52273a1a7d4a2d92c9c7a315cdf2ecdd28577fa680be78cfd2d0dfad94d7"},"source":{"id":"2605.15229","kind":"arxiv","version":1},"verdict":{"id":"a4e3f6fd-b5b5-4cd9-a0fe-9c4919b32fa4","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T17:19:53.915761Z","strongest_claim":"Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation.","one_line_summary":"PBT-Bench evaluates eight LLMs on 100 property-based testing tasks requiring derivation of invariants from docs and construction of targeted input generators to reveal 365 injected semantic bugs.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The 365 injected bugs and the three difficulty strata (L1-L3) are representative of real semantic bugs that would matter in production Python libraries, and that the curation process did not inadvertently favor bugs that current LLMs happen to be good or bad at detecting.","pith_extraction_headline":"A new benchmark shows LLMs using Hypothesis scaffolding can recall 42 to 83 percent of injected semantic bugs by turning library documentation into precise property-based tests."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.15229/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"claim_evidence","ran_at":"2026-05-19T18:01:56.058708Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T17:31:18.489718Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T17:26:27.336623Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T13:33:22.830811Z","status":"skipped","version":"1.0.0","findings_count":0}],"snapshot_sha256":"6b3291bbc3c4ac20153f7f00e8d74d52624da6c1011ab732c33642d044593ef6"},"references":{"count":17,"sample":[{"doi":"","year":null,"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","ref_index":1,"cited_arxiv_id":"2107.03374","is_internal_anchor":true},{"doi":"10.1145/3663529","year":null,"title":"doi: 10.1145/3663529. 3663801. Jason Chou, Ao Liu, Yuchi Deng, et al. AutoCodeBench: Large language models are automatic code benchmark generators,","work_id":"2f480db9-fa3d-4d36-9f87-a720a63f81ff","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"2508.09101 , archivePrefix =","work_id":"d89726ee-893f-4853-8207-bbed25feabc1","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/3597926.3598067","year":null,"title":"Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models","work_id":"502d0ae7-5fff-4775-bb14-80113ea8826f","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/3597503.3623343","year":null,"title":"Meyer, and Thomas Fritz","work_id":"50791d64-17f8-4df6-ae45-662b6a9cf83a","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":17,"snapshot_sha256":"58d63ed17bb999521e00c158b363145948a8ee4508b5903b16321737e1a3c3aa","internal_anchors":3},"formal_canon":{"evidence_count":1,"snapshot_sha256":"ce82281ccfe7cee54280215db6a3dfca699e7a3beed5ad4e0a700cb37adb30c0"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}