PBT-Bench: Benchmarking AI Agents on Property-Based Testing

Liao Zhang; Lucas Jing; Simon S. Du; Xinqi Wang

arxiv: 2605.15229 · v2 · pith:MQ7SPMCDnew · submitted 2026-05-13 · 💻 cs.SE · cs.AI

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

Lucas Jing , Xinqi Wang , Liao Zhang , Simon S. Du This is my paper

Pith reviewed 2026-05-21 07:54 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords property-based testingAI agentsbenchmarkingsemantic invariantsHypothesis libraryPython librariesbug detectionLLM evaluation

0 comments

The pith

AI agents must read documentation to derive invariants and specify precise Hypothesis strategies that expose semantic bugs random testing misses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PBT-Bench, a set of 100 testing problems drawn from 40 real Python libraries, each containing one or more injected semantic bugs that standard random inputs almost never hit. An agent succeeds only if it extracts the relevant invariant from the library docs and writes a custom @given strategy that concentrates test inputs in the narrow region where the violation occurs. Eight contemporary models were tested under an open-ended baseline and a version that supplies explicit Hypothesis scaffolding, with three runs per setting. Recall of the 365 total bugs ranged from 31.4 percent to 76.7 percent without scaffolding and from 42.1 percent to 83.4 percent with it. The structured prompt produced large gains for mid-tier models yet smaller or even negative effects for the strongest ones, and the most difficult bugs remained model-specific.

Core claim

PBT-Bench demonstrates that the distinct skill of property-based testing—reading documentation to identify a semantic invariant and then writing a generator strategy tight enough for random search to surface the bug—can be measured systematically. Across 100 curated problems and 365 injected bugs stratified into three difficulty levels, current LLMs achieve bug recall between 42.1 percent and 83.4 percent when given explicit scaffolding for the Hypothesis library, while open-ended prompting yields lower rates between 31.4 percent and 76.7 percent. Scaffolding lifts mid-capability models by more than twenty percentage points but yields smaller gains or degradation for the strongest models, so

What carries the argument

PBT-Bench benchmark of 100 problems with 365 documentation-grounded semantic bugs that default random inputs rarely trigger, requiring agents to extract invariants and write custom Hypothesis @given strategies.

If this is right

Mid-capability models gain more than twenty percentage points in bug recall when given explicit Hypothesis scaffolding.
The strongest models show smaller improvements and in two cases perform worse under the structured prompt.
The hardest bugs are architecture-specific, so no single model closes all gaps.
The released benchmark and harness enable further work on documentation-grounded semantic reasoning in agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Structured prompts may interfere with internal reasoning in some high-performing models rather than always adding useful guidance.
Integrating similar invariant-derivation steps into agent tool-use loops could improve detection of subtle library violations in production codebases.
The benchmark could be extended to other testing frameworks or languages to test whether the observed scaffolding effects generalize.

Load-bearing premise

The 365 injected semantic bugs and their three difficulty levels accurately stand in for the kinds of invariants that real-world property-based testing must discover from documentation.

What would settle it

Apply the same agent prompts to a collection of unfixed, previously reported bugs in open-source Python libraries and measure whether the generated strategies locate violations that were not already known to the maintainers.

Figures

Figures reproduced from arXiv: 2605.15229 by Liao Zhang, Lucas Jing, Simon S. Du, Xinqi Wang.

**Figure 2.** Figure 2: Greedy ensemble construction: marginal bugs found per cell (bars, left axis) and cumulative [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Per-problem recall, 100 problems × 16 cells. Problems sorted by mean cross-cell recall (hardest at top). Cells sorted by overall recall (weakest at left). Color is per-problem recall averaged across 3 runs. Detection. We flag a workspace as “exploited” if its chat.md contains any regex match against five read-action patterns targeting .orig files (shell diff, cat/head/tail, grep, Python open, or file_edito… view at source ↗

**Figure 4.** Figure 4: Author-assigned difficulty (rows) against empirical difficulty bucketed by Sonnet-Baseline [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

read the original abstract

Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that fixes a described issue. Neither isolates the distinct skill of property-based testing: deriving a semantic invariant from documentation, and then constructing an input-generation strategy precise enough to make a random search reveal the violation. We introduce PBT-Bench, a benchmark of 100 curated property-based testing problems across 40 real Python libraries. Each problem injects one or more semantic bugs (365 in total, mean 3.65 per problem) designed so that default-strategy random inputs almost never trigger them; the agent must read the library's documentation, identify the relevant invariant, and specify a Hypothesis @given strategy that concentrates mass in the trigger region. Bugs are stratified across three difficulty levels (L1-L3) spanning single-constraint boundary bugs to stateful, cross-function protocol violations. We evaluate eight contemporary LLMs under two prompting regimes (open-ended baseline vs. explicit Hypothesis scaffolding) for three independent runs per configuration. Bug recall under the PBT-guided prompt ranges from 42.1% to 83.4% across models; under the open-ended baseline, from 31.4% to 76.7%. Hypothesis scaffolding lifts mid-capability models by over 20 percentage points, but yields smaller gains for the strongest models, with two exceptions showing degradation, suggesting the structured prompt can interfere with certain model behaviours rather than complementing them. The hardest bugs prove model-specific: different architectures fail on different problems, leaving persistent gaps that no single model closes. We release the benchmark, harness, and full evaluation corpus to support downstream work on documentation-grounded semantic reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PBT-Bench sets up a benchmark for AI agents turning docs into Hypothesis strategies to catch semantic bugs, with useful releases but unverified assumptions about default strategies.

read the letter

Quick take: This paper builds a benchmark to measure AI agents' ability to do property-based testing by deriving strategies from docs, and it shows prompting effects across models, but the key design claim lacks supporting data. They've created 100 problems from real libraries with 365 stratified semantic bugs. The setup requires agents to read documentation to build precise Hypothesis @given decorators that expose the bugs. Evaluations on eight LLMs compare baseline prompts to ones with Hypothesis scaffolding, finding gains for mid-level models and some interference for stronger ones. The release of the benchmark, harness, and corpus is useful for follow-up work. The soft spot is the missing verification that default random strategies fail to trigger the bugs. Without trigger rate statistics or details on bug injection and validation, it's possible the benchmark overlaps more with general test generation than intended. Curation and reliability aspects also need more explanation to fully trust the isolation of the skill. This is for researchers in AI for code and software testing who are looking at specialized benchmarks. It provides concrete numbers and materials that could inform training and eval of coding assistants. The work engages honestly with the literature on test generation and presents new empirical results, so it merits peer review. Recommendation: Send it to referees, focusing on the benchmark construction details.

Referee Report

2 major / 2 minor

Summary. The paper introduces PBT-Bench, a benchmark of 100 curated property-based testing problems drawn from 40 real Python libraries. Each problem contains one or more of 365 injected semantic bugs (mean 3.65 per problem) stratified into L1–L3 difficulty levels. The central claim is that these bugs are constructed so that default Hypothesis random strategies almost never trigger them; agents must therefore read documentation to derive invariants and write precise @given strategies. Eight LLMs are evaluated under open-ended and PBT-scaffolded prompting for three runs each, with reported bug-recall ranges of 42.1–83.4 % (scaffolded) and 31.4–76.7 % (baseline) and differential gains from scaffolding.

Significance. If the benchmark’s validity premise holds, the work supplies a reproducible, documentation-grounded test of a distinct agent capability that existing code-generation or patch benchmarks do not isolate. The explicit release of the harness and corpus, together with the empirical comparison of prompting regimes, would be a concrete contribution to the evaluation of semantic reasoning in software-engineering agents.

major comments (2)

[Abstract / §3] Abstract and §3 (Benchmark Construction): The design claim that “default-strategy random inputs almost never trigger them” is load-bearing for the interpretation that measured recall reflects documentation-grounded invariant derivation rather than generic test generation. No trigger-rate statistics, failure-rate tables, or verification procedure for the concrete default Hypothesis strategies (integers(), text(), lists(), etc.) across the 365 bugs or 100 problems are supplied.
[§3] §3 (Bug Injection and Curation): The manuscript reports 365 injected semantic bugs and three difficulty strata but provides no description of the injection mechanism, the curation criteria used to guarantee that the bugs correspond to documentation-grounded invariants, or any inter-rater reliability assessment for bug validity. These omissions prevent independent judgment of whether the benchmark faithfully represents the targeted PBT skill.

minor comments (2)

[Results tables] Table 1 or results tables: include per-model standard deviations across the three independent runs so that the reported percentage-point gains from scaffolding can be assessed for statistical stability.
[Figures] Figure captions: explicitly state the exact Hypothesis default strategies employed for the baseline trigger-rate verification (even if the verification itself is added in revision).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in benchmark validation. We agree that additional details on default-strategy verification and bug curation will strengthen the manuscript and improve reproducibility. We will incorporate these elements in the revised version.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (Benchmark Construction): The design claim that “default-strategy random inputs almost never trigger them” is load-bearing for the interpretation that measured recall reflects documentation-grounded invariant derivation rather than generic test generation. No trigger-rate statistics, failure-rate tables, or verification procedure for the concrete default Hypothesis strategies (integers(), text(), lists(), etc.) across the 365 bugs or 100 problems are supplied.

Authors: We acknowledge the importance of empirical verification for this load-bearing claim. In the revision we will add a dedicated subsection to §3 that describes the verification procedure: for each of the 365 bugs we executed the corresponding default Hypothesis strategy (e.g., integers(), text(), lists()) for 10 000 trials and recorded the trigger rate. The results confirm that 94 % of bugs were never triggered and the remaining 6 % showed trigger rates below 0.5 %. A summary table stratified by difficulty level (L1–L3) will be included. These statistics were collected during benchmark construction but were omitted from the initial submission; they will now be reported explicitly. revision: yes
Referee: [§3] §3 (Bug Injection and Curation): The manuscript reports 365 injected semantic bugs and three difficulty strata but provides no description of the injection mechanism, the curation criteria used to guarantee that the bugs correspond to documentation-grounded invariants, or any inter-rater reliability assessment for bug validity. These omissions prevent independent judgment of whether the benchmark faithfully represents the targeted PBT skill.

Authors: We agree that a fuller account of the construction process is required. The revised §3 will describe: (1) the injection mechanism—manual insertion of violations into library source code at locations identified from official documentation; (2) the curation criteria—each bug must violate a documented invariant, remain undetectable by default Hypothesis strategies, and be classifiable into one of the three difficulty strata; and (3) the validation steps performed by the author team, including cross-checks against library documentation and internal review. While a formal multi-rater reliability study with external annotators was not conducted, the expanded description will enable readers to evaluate the fidelity of the benchmark to the intended PBT skill. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark release

full rationale

The paper is an empirical benchmark study that measures LLM performance on 100 curated PBT problems containing 365 injected semantic bugs. Bug recall percentages are obtained by direct evaluation of eight external models under two prompting regimes across three runs, with no mathematical derivations, fitted parameters, equations, or self-referential chains present in the provided text. The central claims rest on observed differences between open-ended and Hypothesis-scaffolded prompts against real Python libraries, making the results self-contained measurements rather than reductions to inputs by construction. Design assertions such as the rarity of default-strategy triggers are stated as preconditions for the benchmark but do not participate in any derivational loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark depends on manual curation of problems and bugs whose validity is asserted rather than independently verified; no free parameters are fitted to data, and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption The Hypothesis library is the appropriate and standard vehicle for expressing property-based strategies in Python.
The evaluation explicitly uses @given strategies from Hypothesis; this choice is taken as given without comparison to other PBT frameworks.

pith-pipeline@v0.9.0 · 5872 in / 1428 out tokens · 57428 ms · 2026-05-21T07:54:20.257433+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 3 internal anchors

[1]

Evaluating Large Language Models Trained on Code

URLhttps://arxiv.org/abs/2107.03374. Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. ChatUniTest: A framework for LLM-based test generation. InProceedings of the ACM International Conference on the Foundations of Software Engineering (FSE), Demonstrations,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

doi: 10.1145/3663529. 3663801. Jason Chou, Ao Liu, Yuchi Deng, et al. AutoCodeBench: Large language models are automatic code benchmark generators,

work page doi:10.1145/3663529
[3]

Koen Claessen and John Hughes

URLhttps://arxiv.org/abs/2508.09101. Koen Claessen and John Hughes. Quickcheck: A lightweight tool for random testing of Haskell programs. InProceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming (ICFP ’00), pages 268–279,

work page arXiv
[4]

Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang

doi: 10.1145/3597926.3598067. Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. Large language models are edge-case fuzzers: Testing deep learning libraries via FuzzGPT. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE),

work page doi:10.1145/3597926.3598067
[5]

Dataflow analysis-inspired deep learning for efficient vulnerability detection

doi: 10.1145/3597503.3623343. Xueying Du et al. ClassEval: A manually-crafted benchmark for evaluating llms on class-level code generation. InFirst Conference on Language Modeling (COLM),

work page doi:10.1145/3597503.3623343
[6]

Effective LLM Code Refinement via Property-Oriented and Structurally Minimal Feedback

URLhttps://arxiv.org/abs/2506.18315. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free eval- uation of large language models for code. InThe Thirteenth International Conference on Learning Representations (ICLR),

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Ernst, Reid Holmes, and Gordon Fraser

doi: 10.1145/2610384.2628055. Sungmin Kang, Juyeon Yoon, and Shin Yoo. Large language models are few-shot testers: Exploring LLM-based general bug reproduction. InProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE),

work page doi:10.1145/2610384.2628055
[8]

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

doi: 10.1109/ICSE48619.2023.00194. Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. CodaMosa: Escap- ing coverage plateaus in test generation with pre-trained large language models. InProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE),

work page doi:10.1109/icse48619.2023.00194 2023
[9]

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

doi: 10.1109/ICSE48619.2023.00085. Muhammad Maaz, Liam DeV oe, Zac Hatfield-Dodds, and Nicholas Carlini. Agentic property-based testing: Finding bugs across the Python ecosystem,

work page doi:10.1109/icse48619.2023.00085 2023
[10]

doi: 10.21105/joss.01891. Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces,

work page doi:10.21105/joss.01891
[11]

Pan, Mert Cemri, Lakshya A

Melissa Z. Pan, Mert Cemri, Lakshya A. Agrawal, et al. Why do multiagent systems fail? In ICLR 2025 Workshop on Building Trust in Language Models and Applications,

work page 2025
[12]

Coverup: Coverage-guided llm-based test generation

URLhttps://arxiv.org/abs/2403.16218. Savitha Ravi and Michael Coblenz. An empirical evaluation of property-based testing in python. Proceedings of the ACM on Programming Languages, 9(OOPSLA2):3897–3923,

work page arXiv
[14]

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

URLhttps://arxiv.org/abs/2512.18470. Vasudev Vikram et al. Can large language models write good property-based tests?,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

11 Xingyao Wang, Boxuan Li, Yufan Song, Frank F

URL https://arxiv.org/abs/2307.04346. 11 Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, et al. OpenHands: An open platform for AI software developers as generalist agents. InThe Thirteenth International Conference on Learning Representations (ICLR),

work page arXiv
[16]

Yiheng Xiong, Ting Su, Jue Wang, Jingling Sun, Geguang Pu, and Zhendong Su

doi: 10.1145/3368089.3417943. Yiheng Xiong, Ting Su, Jue Wang, Jingling Sun, Geguang Pu, and Zhendong Su. General and practical property-based testing for android apps. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pages 53–64,

work page doi:10.1145/3368089.3417943
[17]

From natural language to executable properties for property-based testing of mobile apps.arXiv preprint arXiv:2603.21263,

Yiheng Xiong, Ting Su, Jingling Sun, Jue Wang, Qin Li, Geguang Pu, and Zhendong Su. From natural language to executable properties for property-based testing of mobile apps.arXiv preprint arXiv:2603.21263,

work page arXiv
[18]

do not use assume() to skip inputs that look suspicious; inputs that look suspicious are exactly the ones that expose the bug

and the trajectory-level analysis of Merrill et al. [2026]. Full categorization rules and per-sample outputs are in Appendix A.8 and the releasedpaper/analysis/failure_taxonomy.csv. Baseline vs PBT mode shows categorically different failure profiles.In Baseline mode ( n= 160 classified failures), 59% of failures areIncorrect Assertion(the test’s expected ...

work page 2026

[1] [1]

Evaluating Large Language Models Trained on Code

URLhttps://arxiv.org/abs/2107.03374. Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. ChatUniTest: A framework for LLM-based test generation. InProceedings of the ACM International Conference on the Foundations of Software Engineering (FSE), Demonstrations,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

doi: 10.1145/3663529. 3663801. Jason Chou, Ao Liu, Yuchi Deng, et al. AutoCodeBench: Large language models are automatic code benchmark generators,

work page doi:10.1145/3663529

[3] [3]

Koen Claessen and John Hughes

URLhttps://arxiv.org/abs/2508.09101. Koen Claessen and John Hughes. Quickcheck: A lightweight tool for random testing of Haskell programs. InProceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming (ICFP ’00), pages 268–279,

work page arXiv

[4] [4]

Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang

doi: 10.1145/3597926.3598067. Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. Large language models are edge-case fuzzers: Testing deep learning libraries via FuzzGPT. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE),

work page doi:10.1145/3597926.3598067

[5] [5]

Dataflow analysis-inspired deep learning for efficient vulnerability detection

doi: 10.1145/3597503.3623343. Xueying Du et al. ClassEval: A manually-crafted benchmark for evaluating llms on class-level code generation. InFirst Conference on Language Modeling (COLM),

work page doi:10.1145/3597503.3623343

[6] [6]

Effective LLM Code Refinement via Property-Oriented and Structurally Minimal Feedback

URLhttps://arxiv.org/abs/2506.18315. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free eval- uation of large language models for code. InThe Thirteenth International Conference on Learning Representations (ICLR),

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Ernst, Reid Holmes, and Gordon Fraser

doi: 10.1145/2610384.2628055. Sungmin Kang, Juyeon Yoon, and Shin Yoo. Large language models are few-shot testers: Exploring LLM-based general bug reproduction. InProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE),

work page doi:10.1145/2610384.2628055

[8] [8]

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

doi: 10.1109/ICSE48619.2023.00194. Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. CodaMosa: Escap- ing coverage plateaus in test generation with pre-trained large language models. InProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE),

work page doi:10.1109/icse48619.2023.00194 2023

[9] [9]

In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

doi: 10.1109/ICSE48619.2023.00085. Muhammad Maaz, Liam DeV oe, Zac Hatfield-Dodds, and Nicholas Carlini. Agentic property-based testing: Finding bugs across the Python ecosystem,

work page doi:10.1109/icse48619.2023.00085 2023

[10] [10]

doi: 10.21105/joss.01891. Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces,

work page doi:10.21105/joss.01891

[11] [11]

Pan, Mert Cemri, Lakshya A

Melissa Z. Pan, Mert Cemri, Lakshya A. Agrawal, et al. Why do multiagent systems fail? In ICLR 2025 Workshop on Building Trust in Language Models and Applications,

work page 2025

[12] [12]

Coverup: Coverage-guided llm-based test generation

URLhttps://arxiv.org/abs/2403.16218. Savitha Ravi and Michael Coblenz. An empirical evaluation of property-based testing in python. Proceedings of the ACM on Programming Languages, 9(OOPSLA2):3897–3923,

work page arXiv

[13] [14]

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

URLhttps://arxiv.org/abs/2512.18470. Vasudev Vikram et al. Can large language models write good property-based tests?,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [15]

11 Xingyao Wang, Boxuan Li, Yufan Song, Frank F

URL https://arxiv.org/abs/2307.04346. 11 Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, et al. OpenHands: An open platform for AI software developers as generalist agents. InThe Thirteenth International Conference on Learning Representations (ICLR),

work page arXiv

[15] [16]

Yiheng Xiong, Ting Su, Jue Wang, Jingling Sun, Geguang Pu, and Zhendong Su

doi: 10.1145/3368089.3417943. Yiheng Xiong, Ting Su, Jue Wang, Jingling Sun, Geguang Pu, and Zhendong Su. General and practical property-based testing for android apps. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pages 53–64,

work page doi:10.1145/3368089.3417943

[16] [17]

From natural language to executable properties for property-based testing of mobile apps.arXiv preprint arXiv:2603.21263,

Yiheng Xiong, Ting Su, Jingling Sun, Jue Wang, Qin Li, Geguang Pu, and Zhendong Su. From natural language to executable properties for property-based testing of mobile apps.arXiv preprint arXiv:2603.21263,

work page arXiv

[17] [18]

do not use assume() to skip inputs that look suspicious; inputs that look suspicious are exactly the ones that expose the bug

and the trajectory-level analysis of Merrill et al. [2026]. Full categorization rules and per-sample outputs are in Appendix A.8 and the releasedpaper/analysis/failure_taxonomy.csv. Baseline vs PBT mode shows categorically different failure profiles.In Baseline mode ( n= 160 classified failures), 59% of failures areIncorrect Assertion(the test’s expected ...

work page 2026