Pith Number

pith:MQ7SPMCD

pith:2026:MQ7SPMCD6656MLH2JYE2DVOWIH

not attested not anchored not stored refs resolved

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

Liao Zhang, Lucas Jing, Simon S. Du, Xinqi Wang

A new benchmark shows LLMs using Hypothesis scaffolding can recall 42 to 83 percent of injected semantic bugs by turning library documentation into precise property-based tests.

arxiv:2605.15229 v1 · 2026-05-13 · cs.SE · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{MQ7SPMCD6656MLH2JYE2DVOWIH}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation.

C2weakest assumption

The 365 injected bugs and the three difficulty strata (L1-L3) are representative of real semantic bugs that would matter in production Python libraries, and that the curation process did not inadvertently favor bugs that current LLMs happen to be good or bad at detecting.

C3one line summary

PBT-Bench evaluates eight LLMs on 100 property-based testing tasks requiring derivation of invariants from docs and construction of targeted input generators to reveal 365 injected semantic bugs.

References

17 extracted · 17 resolved · 3 Pith anchors

[1] Evaluating Large Language Models Trained on Code · arXiv:2107.03374

[2] doi: 10.1145/3663529. 3663801. Jason Chou, Ao Liu, Yuchi Deng, et al. AutoCodeBench: Large language models are automatic code benchmark generators, · doi:10.1145/3663529

[3] 2508.09101 , archivePrefix =

[4] Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models · doi:10.1145/3597926.3598067

[5] Meyer, and Thomas Fritz · doi:10.1145/3597503.3623343

Formal links

1 machine-checked theorem link

Receipt and verification

First computed	2026-05-20T00:00:47.409729Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

643f27b043f7bbe62cfa4e09a1d5d641d7a618f996dace6e79adef335de56b9a

Aliases

arxiv: 2605.15229 · arxiv_version: 2605.15229v1 · doi: 10.48550/arxiv.2605.15229 · pith_short_12: MQ7SPMCD6656 · pith_short_16: MQ7SPMCD6656MLH2 · pith_short_8: MQ7SPMCD

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/MQ7SPMCD6656MLH2JYE2DVOWIH \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 643f27b043f7bbe62cfa4e09a1d5d641d7a618f996dace6e79adef335de56b9a

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "29cb9b7070890324b6fef319ba769377b23bcf17188322794f58b62876b0df22",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.SE",
    "submitted_at": "2026-05-13T18:01:05Z",
    "title_canon_sha256": "c2ae6a7478549fdba58e03d4387701cb6e72cd4752bab4a6e85fc51c8f23d037"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15229",
    "kind": "arxiv",
    "version": 1
  }
}