pith. sign in
Pith Number

pith:MQ7SPMCD

pith:2026:MQ7SPMCD6656MLH2JYE2DVOWIH
not attested not anchored not stored refs resolved

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

Liao Zhang, Lucas Jing, Simon S. Du, Xinqi Wang

A new benchmark shows LLMs using Hypothesis scaffolding can recall 42 to 83 percent of injected semantic bugs by turning library documentation into precise property-based tests.

arxiv:2605.15229 v1 · 2026-05-13 · cs.SE · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{MQ7SPMCD6656MLH2JYE2DVOWIH}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation.

C2weakest assumption

The 365 injected bugs and the three difficulty strata (L1-L3) are representative of real semantic bugs that would matter in production Python libraries, and that the curation process did not inadvertently favor bugs that current LLMs happen to be good or bad at detecting.

C3one line summary

PBT-Bench evaluates eight LLMs on 100 property-based testing tasks requiring derivation of invariants from docs and construction of targeted input generators to reveal 365 injected semantic bugs.

References

17 extracted · 17 resolved · 3 Pith anchors

[1] Evaluating Large Language Models Trained on Code · arXiv:2107.03374
[2] doi: 10.1145/3663529. 3663801. Jason Chou, Ao Liu, Yuchi Deng, et al. AutoCodeBench: Large language models are automatic code benchmark generators, · doi:10.1145/3663529
[3] 2508.09101 , archivePrefix =
[4] Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models · doi:10.1145/3597926.3598067
[5] Meyer, and Thomas Fritz · doi:10.1145/3597503.3623343

Formal links

1 machine-checked theorem link

Receipt and verification
First computed 2026-05-20T00:00:47.409729Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

643f27b043f7bbe62cfa4e09a1d5d641d7a618f996dace6e79adef335de56b9a

Aliases

arxiv: 2605.15229 · arxiv_version: 2605.15229v1 · doi: 10.48550/arxiv.2605.15229 · pith_short_12: MQ7SPMCD6656 · pith_short_16: MQ7SPMCD6656MLH2 · pith_short_8: MQ7SPMCD
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/MQ7SPMCD6656MLH2JYE2DVOWIH \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 643f27b043f7bbe62cfa4e09a1d5d641d7a618f996dace6e79adef335de56b9a
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "29cb9b7070890324b6fef319ba769377b23bcf17188322794f58b62876b0df22",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.SE",
    "submitted_at": "2026-05-13T18:01:05Z",
    "title_canon_sha256": "c2ae6a7478549fdba58e03d4387701cb6e72cd4752bab4a6e85fc51c8f23d037"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15229",
    "kind": "arxiv",
    "version": 1
  }
}