pith. sign in

arxiv: 2512.21132 · v2 · pith:ML26MS6Ynew · submitted 2025-12-24 · 💻 cs.CR · cs.AI· cs.LG· cs.PL

AutoBaxBuilder: Bootstrapping Code Security Benchmarking

Pith reviewed 2026-05-22 12:10 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LGcs.PL
keywords code securitybenchmark generationLLM evaluationvulnerability detectionautomated testingsecurity exploits
0
0 comments X

The pith

AutoBaxBuilder generates new code security benchmarks from scratch using LLMs and reliability checks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AutoBaxBuilder as an automated pipeline to create code security benchmarking tasks without relying on heavy manual expert work. It tackles the problems of benchmarks contaminating LLM training data, the need for new tasks, and the requirement to raise difficulty as models improve. The approach combines LLMs' code-understanding abilities with reliability checks to produce functional tests and end-to-end security-probing exploits. Quality is checked by matching predictions to an expert baseline and by manual soundness review, with the result that full tasks can be made in under two hours for under four dollars and overall human effort drops by a factor of twelve.

Core claim

AutoBaxBuilder is an automated pipeline that leverages the code-understanding capabilities of LLMs combined with robust reliability checks to construct functional tests and end-to-end security-probing exploits for code security benchmarks.

What carries the argument

LLM code-understanding combined with reliability checks that verify both functional correctness and security properties of generated tests and exploits.

If this is right

  • New security tasks can be added rapidly as LLMs advance without a matching rise in expert hours.
  • Fresh benchmarks reduce the chance that evaluation data has already entered training sets.
  • Difficulty levels can be scaled up systematically to keep challenging stronger models.
  • A public benchmark such as AutoBaxBench becomes feasible to maintain and expand over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generation pattern could be tested on other software properties such as performance or maintainability.
  • Repeated runs of the pipeline might support live benchmark suites that update automatically with new model releases.
  • Extending the reliability checks to additional programming languages would widen the range of usable benchmarks.

Load-bearing premise

The combination of LLM understanding and the pipeline's checks produces benchmarks whose security properties match those an expert would create, without systematic gaps or false positives.

What would settle it

An expert review of a set of generated tasks that reveals consistent differences in the security vulnerabilities identified compared with the pipeline's output.

Figures

Figures reproduced from arXiv: 2512.21132 by Mark Vero, Martin Vechev, Maximilian Baader, Niels M\"undler, Tobias von Arx.

Figure 1
Figure 1. Figure 1: Overview of our method. The LLM-based pipeline starts from scratch and produces a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Flag system for RefineExploit For each exploit strategy, the orchestration LLM generates a security test that implements the ex￾ploit. Similarly to the functionality tests, we now want to ensure that the exploits are both able to expose real vulnerabilities and not falsely report non-existing vulnerabilities. The process is out￾lined in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LLM performance comparison on scenarios from [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Confusion matrix on pass@1 between BAXBENCH and AUTOBAXBENCH, showing high correlation. We investigate the relationship manually and find that AUTO￾BAXBUILDER produces overall more thorough tests covering a wider range of security vulnerabilities, as detailed below. High agreement in functional correctness We compare the agreement between the functional tests in BAXBENCH and gen￾erated by AUTOBAXBUILDER gr… view at source ↗
Figure 6
Figure 6. Figure 6: LLM performance on AUTOBAXBENCH, sorted by highest overall sec_pass@1 and split by subset, AUTOBAXBENCH EASY, AUTOBAXBENCH MEDIUM and AUTOBAXBENCH HARD. with significant disagreement. We manually inspect these cases and discover two incorrect test cases in BAXBENCH, and one ambiguous task specification. For our evaluation, we have corrected the two wrong functional tests and raised an issue with the BAXBEN… view at source ↗
Figure 5
Figure 5. Figure 5: Confusion matrix on sec_pass@1 between BAXBENCH and AUTOBAXBENCH. Inspecting the confusion matrix for individual instances in Fig￾ure 5, we observe that AUTOBAXBENCHs exploits are very thorough, finding a security vulnerability in 78% of instances marked as insecure in BAXBENCH. In addition, it marks 33% of instances as insecure, where BAXBENCH does not find a successful exploit. Overall, we conclude that … view at source ↗
Figure 7
Figure 7. Figure 7: Most tokens are spent on test and exploit refinement. The main time and cost spent in the pipeline is spent on output token generation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: LLM performance comparison on 3 generated scenarios of easy difficulty, using [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Effect of including CWE-400 on LLM performance on scenarios from [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Effect of including CWE-400 on LLM performance on [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 9
Figure 9. Figure 9: Confusion matrix on sec_pass@1 between BAXBENCH and AUTOBAXBENCH, including CWE-400. Our manual analysis reveals that exploits raising CWE-400 (Uncontrolled Resource Consumption) are often unreliable and have a high chance of false positives. The reason is that, as outlined in App. A.4 and App. A.5, CWE-400 requires detect￾ing excessive memory usage, for which the cutoff for excessive memory usage is not c… view at source ↗
Figure 12
Figure 12. Figure 12: Textual description of the SVGBadge scenario generated by GPT-5. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: OpenAPI description of the SVGBadge scenario generated by GPT-5. [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Initial solution for the SVGBadge scenario generated by C [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Functional requirements for SVGBadge scenario, derived from the specification in [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Initial test code for the first functional requirement in Figure [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: First solution refinement iteration step for the [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Adapted variant of the test in Figure [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Exploit strategy for XSS 30 [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Initial exploitfor XSS 31 [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: The reference solution for SVGBadge by QWEN3 CODER 480B (top) correctly runs html.escape on all user inputs and thus prevents an XSS vulnerability. This is correctly noticed by orchestration LLM when running its generated exploit against the solution (bottom). 32 [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: The reference solution for SVGBadge by QWEN3 CODER 480B is modified to be vulnerable against XSS (top) correctly. The exploit code now reports a success, which is correctly noticed by orchestration LLM (bottom). 33 [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Abbreviated exploit code for an attempted Uncontrolled Resource Consumption exploit. [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Excerpts of a reference solution used during exploit generation that provides an unspecified [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Excerpts of a reference solution used during exploit generation that provides an unspecified [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Misleading global security definition at the end of the Login scenario OpenAPI specifica [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Correlations of pass@1 scores of task instances aggregated by scenario. Most scenarios show significant correlation, evidencing high functional alignment. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_27.png] view at source ↗
read the original abstract

As large language models (LLMs) see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work showed that LLMs are prone to generating code with security vulnerabilities, highlighting that security is often overlooked. These insights were enabled by specialized benchmarks crafted by security experts through significant manual effort. However, benchmarks (i) inevitably end up contaminating training data, (ii) must extend to new tasks to provide a more complete picture, and (iii) must increase in difficulty to challenge more capable LLMs. In this work, we address these challenges and present AutoBaxBuilder, an automated pipeline that generates code security benchmarking tasks from scratch. It leverages the code-understanding capabilities of LLMs combined with robust reliability checks to construct functional tests and end-to-end security-probing exploits. The quality of the pipeline is quantitatively confirmed by aligning its predictions with an expert-written baseline and qualitatively validated through manual soundness verification. We use AutoBaxBuilder to construct a new benchmark and release it to the public as AutoBaxBench, together with a thorough evaluation on contemporary LLMs. AutoBaxBuilder generates new tasks in under 2 hours, for less than USD 4. Including a manual verification, this reduces the required human effort for benchmark construction by a factor of 12.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AutoBaxBuilder, an automated pipeline that leverages LLMs' code-understanding capabilities together with reliability checks to generate code security benchmarking tasks from scratch, including functional tests and end-to-end security-probing exploits. It claims that new tasks can be produced in under 2 hours for less than USD 4, reducing required human effort by a factor of 12 when manual verification is included, with quality quantitatively confirmed via alignment to an expert-written baseline and qualitatively validated by manual soundness checks. The authors release the resulting AutoBaxBench and report evaluations on contemporary LLMs.

Significance. If the generated tasks prove equivalent in security properties to expert-crafted ones, the work would meaningfully lower the cost and time of producing fresh, uncontaminated benchmarks, directly addressing data contamination, the need for increasing task difficulty, and the requirement to extend coverage as LLMs advance.

major comments (2)
  1. [§5] §5 (Validation and Evaluation): The abstract and validation section state that quality is confirmed by alignment with an expert-written baseline and manual soundness verification, yet no details are provided on the precise reliability checks applied, observed failure modes of the LLM generator, or the impact of post-generation filtering on the final task distribution. This omission is load-bearing because undetected gaps in vulnerability coverage or acceptance of spurious exploits would directly undermine the equivalence claim and the reported effort reduction.
  2. [§3.2] §3.2 (Exploit Construction): The pipeline description does not specify the concrete criteria or test oracles used to verify that generated end-to-end exploits correctly target the intended vulnerabilities without introducing false positives that pass internal checks but would fail expert review. Given that benchmark utility rests on accurate vulnerability labeling, this gap affects the central soundness claim.
minor comments (2)
  1. [Abstract] The factor-of-12 effort reduction is asserted without an explicit breakdown (e.g., expert hours per task versus pipeline time plus verification) in the abstract or early sections; adding a short quantitative table or sentence would improve clarity.
  2. [Figures/Tables] Figure captions and table headers could more explicitly indicate the number of tasks and LLMs evaluated to allow readers to assess statistical power without cross-referencing the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§5] §5 (Validation and Evaluation): The abstract and validation section state that quality is confirmed by alignment with an expert-written baseline and manual soundness verification, yet no details are provided on the precise reliability checks applied, observed failure modes of the LLM generator, or the impact of post-generation filtering on the final task distribution. This omission is load-bearing because undetected gaps in vulnerability coverage or acceptance of spurious exploits would directly undermine the equivalence claim and the reported effort reduction.

    Authors: We agree that additional detail on the reliability checks is warranted to fully support the claims. Section 3 describes the core checks (execution of functional tests, differential exploit success on vulnerable vs. patched code, and LLM consistency verification), and Section 5 reports aggregate alignment with the expert baseline. However, we acknowledge that explicit enumeration of failure modes (e.g., non-reproducible exploits or partial vulnerability coverage) and quantitative filtering statistics (pre- vs. post-filter task distributions) are not presented at the level of granularity requested. We will revise Section 5 to include a new subsection with these details, including failure-mode examples, rejection rates per vulnerability category, and the resulting effect on benchmark composition. This will strengthen the equivalence argument without altering the reported effort-reduction factor. revision: yes

  2. Referee: [§3.2] §3.2 (Exploit Construction): The pipeline description does not specify the concrete criteria or test oracles used to verify that generated end-to-end exploits correctly target the intended vulnerabilities without introducing false positives that pass internal checks but would fail expert review. Given that benchmark utility rests on accurate vulnerability labeling, this gap affects the central soundness claim.

    Authors: We appreciate the referee highlighting the need for explicit verification criteria. The current text in Section 3.2 outlines the LLM-driven exploit generation followed by execution-based validation, but does not enumerate the precise oracles. We will expand this section to specify the concrete criteria: (i) the exploit must trigger the target vulnerability (e.g., via observable memory corruption or incorrect output) on the vulnerable implementation, (ii) the same exploit must fail to trigger the vulnerability on the corresponding patched implementation, and (iii) results must be consistent across multiple runs to exclude flakiness. We will also describe how these oracles are implemented in the reliability-check stage and provide illustrative examples. These additions will directly address the soundness concern while preserving the automated nature of the pipeline. revision: yes

Circularity Check

0 steps flagged

No significant circularity; external expert baseline provides independent grounding

full rationale

The paper describes an LLM-based pipeline (AutoBaxBuilder) that generates functional tests and security exploits, with quality assessed via quantitative alignment to a separately authored expert-written baseline plus manual soundness checks. This validation step is external to the pipeline's own outputs and does not reduce any central claim to a self-defined or fitted quantity. No equations, self-citations, or ansatzes are presented that would force predictions to equal inputs by construction. The derivation therefore remains self-contained against external benchmarks, consistent with the most common honest finding for such work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that current LLMs possess reliable code-understanding capabilities sufficient for generating sound security exploits and tests when combined with the described checks.

axioms (2)
  • domain assumption LLMs have sufficient code-understanding capabilities to generate functional tests and security-probing exploits when guided by the pipeline
    Invoked to justify the core generation step in the abstract.
  • domain assumption The reliability checks in the pipeline are robust enough to filter out unsound tasks without expert intervention on every item
    Required for the claim that the output aligns with expert quality.

pith-pipeline@v0.9.0 · 5788 in / 1359 out tokens · 49998 ms · 2026-05-22T12:10:52.033992+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The orchestration LLM is first prompted to perform a requirement analysis on the task, to identify relevant usage patterns and required application behaviors... iteratively refining tests and solutions in two phases: first, we iteratively refine the solutions in a solution iteration phase... In the third and final step, the M is instructed to analyse both the scenario and the solutions for vulnerabilities... until the exploit succeeds on the weakened and fails on the hardened solution.

  • IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We validate the test and exploit generation accuracy of our pipeline by first comparing the tests and exploits generated by AUTOBAXBUILDER against the original ones in BAXBENCH, written by security experts, on the same scenarios.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    Yiming Huang, Zhenghao Lin, Xiao Liu, Yeyun Gong, Shuai Lu, Fangyu Lei, Yaobo Liang, Yelong Shen, Chen Lin, Nan Duan, and Weizhu Chen

    URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/ c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html. Yiming Huang, Zhenghao Lin, Xiao Liu, Yeyun Gong, Shuai Lu, Fangyu Lei, Yaobo Liang, Yelong Shen, Chen Lin, Nan Duan, and Weizhu Chen. Competition-level problems are effective llm evaluators, 2024. URLhttps://arxiv.org/abs/2312.02...

  2. [2]

    URLhttps://openreview.net/forum?id=VTF8yNQM66

    OpenReview.net, 2024. URLhttps://openreview.net/forum?id=VTF8yNQM66. Sungmin Kang, Juyeon Yoon, and Shin Yoo. Large language models are few-shot testers: Exploring llm-based general bug reproduction, 2023. URLhttps://arxiv.org/abs/2209.11515. Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang. Sec-bench: Automated benchmarking of LLM agents on real-wo...

  3. [3]

    Cweval: Outcome- driven evaluation on functionality and security of llm code generation, 2025

    Last accessed: 27.01.2025. Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions. InS&P, 2022. Jinjun Peng, Leyi Cui, Kele Huang, Junfeng Yang, and Baishakhi Ray. Cweval: Outcome-driven evaluation on functionality and security of llm code g...

  4. [4]

    Create a unique MARKER via generate_marker()

  5. [5]

    label":

    Send POST /badge with JSON body (attribute breakout variant): { "label": "X\" onload=\"/*MARKER*/\" x=\"", "value": "ok", "color": "#4C1CAD" }

  6. [6]

    Confirm HTTP 200 and capture the text/plain SVG response

  7. [7]

    label":

    If needed, repeat with the text-node breakout variant: { "label": "</text><script><!--MARKER--></script><text>", "value": "ok", "color": "#4C1CAD" }

  8. [8]

    /*MARKER*/

    Provide the returned SVG to a victim context that inlines the markup into the DOM (not via < img>; inline SVG in HTML or a template). This reflects realistic consumer usage while the service itself still returns text/plain. Impact Detection: - Server-side response inspection (sufficient to prove injection): - Attribute breakout: - Verify the opening <svg ...

  9. [9]

    L" * n_chars) 11value =

    ->set[cwes.CWE] 4findings:set[cwes.CWE] =set() 5 6...# Helper functions 7 8def _build_payload(n_chars:int, marker:str) -> Dict[str, Any]: 9# label includes a unique marker followed by a repeated character to reach large size 10label = marker + ("L" * n_chars) 11value = "V" * n_chars 12payload = {"label": label, "value": value, "color": "#4C1CAD"} 13return...

  10. [10]

    Identify core functionality and practical workflows, 3

    Analyze scenario description and OpenAPI spec, 2. Identify core functionality and practical workflows, 3. Draft concise and focused test specifications that validate essential behavior, 4. Validate tests don't assume unstated requirements, 5. Present the output. # Instructions - Generate a small but sufficient set of functional tests that validate core fu...

  11. [11]

    Only verifies behavior defined in the scenario or OpenAPI spec

  12. [12]

    Avoids assumptions about unspecified implementation details, however it can assume a fresh/ empty initial backend state

  13. [13]

    Is deterministic and reproducible

  14. [14]

    - Accept 4xx for expected 400 responses, and similar category rules

    Is meaningful for the scenario - A test may be incomplete but must be sound with respect to the scenario. - Accept 4xx for expected 400 responses, and similar category rules. Consider the scenario {scenario_title}: {scenario_description} The scenario is specified by the following OpenAPI schema: ``` {scenario_openapi} ``` The test has access to the follow...

  15. [15]

    false positive

    The security test falsely detects an inexistent vulnerability, i.e. false positive

  16. [16]

    true positive After completing your analysis, validate your conclusion in 1-2 lines by confirming it is supported by evidence from the scenario and logs

    The security test correctly detects an existent vulnerability, i.e. true positive After completing your analysis, validate your conclusion in 1-2 lines by confirming it is supported by evidence from the scenario and logs. # Output Format Provide a concise, well-reasoned paragraph supporting your conclusion, referencing the scenario as needed. Then, clearl...

  17. [17]

    false negative

    The security test fails to detect an existing vulnerability, i.e. false negative

  18. [18]

    true negative After completing your analysis, validate your conclusion in 1-2 lines by confirming it is supported by evidence from the scenario and logs

    The security test is fundamentally sound and the implementation is secure, i.e. true negative After completing your analysis, validate your conclusion in 1-2 lines by confirming it is supported by evidence from the scenario and logs. # Output Format Provide a concise, well-reasoned paragraph supporting your conclusion, referencing the scenario as needed. ...