pith. sign in

arxiv: 2509.22097 · v4 · submitted 2025-09-26 · 💻 cs.SE · cs.AI· cs.CL· cs.CR

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

Pith reviewed 2026-05-18 13:04 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.CR
keywords secure codingAI code agentsvulnerability benchmarksOSS-Fuzzmulti-file code editsC/C++ securitystatic and dynamic analysis
0
0 comments X p. Extension

The pith

Current AI code agents produce both correct and secure code in only 23.8 percent of realistic multi-file scenarios drawn from actual open-source vulnerabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SecureVibeBench, a collection of 105 C/C++ coding tasks reconstructed from real vulnerabilities in 41 OSS-Fuzz projects. These tasks require agents to perform multi-file edits in large repositories while avoiding the specific vulnerability introduction points identified in the original code. Evaluation of five popular agents backed by five large language models shows that even the strongest combination succeeds on just 23.8 percent of tasks when both functionality and security must be satisfied. This setup allows direct comparison between human developer behavior and agent performance because the benchmark mirrors the exact moments when vulnerabilities were introduced in practice. Sympathetic readers would care because it quantifies the security gap in AI-assisted coding under conditions that match real development workflows.

Core claim

SecureVibeBench reconstructs 105 tasks from OSS-Fuzz projects with precisely identified vulnerability introduction points, multi-file edit requirements in large repositories, and oracles that combine static and dynamic checks for both functionality and security. When five code agents supported by models such as Claude Sonnet 4.5 are tested, the best performer generates correct and secure solutions for only 23.8 percent of the tasks, demonstrating that current agents struggle to match the security awareness exhibited by human developers in the original vulnerability-introducing scenarios.

What carries the argument

SecureVibeBench, a benchmark that reconstructs vulnerability-introducing scenarios from real OSS-Fuzz projects to create aligned contexts for testing code agents on multi-file edits.

If this is right

  • Agents must improve their handling of security constraints during large-scale code modifications.
  • Evaluations should incorporate both static and dynamic oracles to detect vulnerabilities reliably.
  • Benchmarks need to use real vulnerability introduction points rather than synthetic tasks for fair human-agent comparisons.
  • Development workflows using AI agents require additional security review steps given the observed failure rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers might benefit from hybrid workflows where agents propose changes but humans verify security properties.
  • Future benchmarks could extend this approach to other programming languages and vulnerability types beyond C/C++.
  • Training data for code models could be augmented with examples of vulnerability introductions to improve security awareness.

Load-bearing premise

The 105 reconstructed tasks from OSS-Fuzz projects accurately capture the real-world conditions under which human developers introduce vulnerabilities, and the static and dynamic oracles correctly identify secure versus insecure solutions.

What would settle it

A new code agent that consistently produces correct and secure solutions on more than half of the 105 tasks, or independent verification showing that the oracles fail to flag known insecure code from the original projects.

read the original abstract

Large language model-powered code agents are rapidly transforming software engineering, yet the security risks of their generated code have become a critical concern. Existing benchmarks have provided valuable insights, but they fail to capture scenarios in which vulnerabilities are actually introduced by human developers, making fair comparisons between humans and agents infeasible. We therefore introduce SecureVibeBench, a benchmark of 105 C/C++ secure coding tasks sourced from 41 projects in OSS-Fuzz for code agents. SecureVibeBench has the following features: (i) realistic task settings that require multi-file edits in large repositories, (ii)~aligned contexts based on real-world open-source vulnerabilities with precisely identified vulnerability introduction points, and (iii) comprehensive evaluation that combines functionality testing and security checking with both static and dynamic oracles. We evaluate 5 popular code agents like OpenHands, supported by 5 LLMs (e.g., Claude sonnet 4.5) on SecureVibeBench. Results show that current agents struggle to produce both correct and secure code, as even the best-performing one, produces merely 23.8\% correct and secure solutions on SecureVibeBench. Our code and data are on https://github.com/iCSawyer/SecureVibeBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SecureVibeBench, a benchmark of 105 C/C++ tasks reconstructed from 41 OSS-Fuzz projects. Tasks are aligned to real vulnerability-introduction points, require multi-file edits in large repositories, and are evaluated using functionality tests plus static and dynamic security oracles. Evaluation of five agents (e.g., OpenHands) backed by five LLMs shows that even the strongest agent produces only 23.8% solutions that are both correct and secure.

Significance. If the oracles and task reconstruction hold, the benchmark supplies a more realistic testbed than prior synthetic or single-file suites by grounding evaluation in actual human-introduced vulnerabilities and large-repository contexts. The combination of static/dynamic oracles and the public release of code and data are concrete strengths that would allow the community to reproduce and extend the 23.8% headline result.

major comments (2)
  1. [Evaluation Methodology] The security-oracle description (evaluation section) reports use of static analyzers and dynamic tests but provides no explicit validation that the oracles correctly classify the original vulnerable commit as insecure and the patched commit as secure across all 105 tasks. Without such a check, the 23.8% correct-and-secure figure rests on an unverified assumption about oracle precision and coverage.
  2. [Benchmark Construction] Task-construction details (benchmark-construction section) describe sourcing from OSS-Fuzz and identification of vulnerability-introduction points, yet omit quantitative reporting on filtering criteria, inter-rater agreement, or coverage of vulnerability classes. This directly affects whether the 105 tasks support the claim that agents struggle in representative real-world settings.
minor comments (2)
  1. [Abstract] The abstract lists '5 popular code agents like OpenHands, supported by 5 LLMs (e.g., Claude sonnet 4.5)' without enumerating the exact set; a table or explicit list would improve reproducibility.
  2. [Results] Results would benefit from a per-agent, per-LLM breakdown table rather than a single aggregate 23.8% figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your constructive review and for recognizing the strengths of SecureVibeBench in providing a realistic, multi-file benchmark grounded in real OSS-Fuzz vulnerabilities. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Evaluation Methodology] The security-oracle description (evaluation section) reports use of static analyzers and dynamic tests but provides no explicit validation that the oracles correctly classify the original vulnerable commit as insecure and the patched commit as secure across all 105 tasks. Without such a check, the 23.8% correct-and-secure figure rests on an unverified assumption about oracle precision and coverage.

    Authors: We agree that an explicit validation of the oracles against the ground-truth vulnerable and patched commits is necessary to support the reliability of the 23.8% result. The current manuscript describes the oracles but does not report such a systematic check. In the revision we will add a dedicated validation subsection. We will run the static (e.g., CodeQL, Infer) and dynamic oracles on the vulnerable and patched versions for the full set of 105 tasks and report the fraction of tasks where the oracles correctly flag the vulnerable commit as insecure and the patched commit as secure. Any discrepancies or limitations will be discussed. This addition directly addresses the concern. revision: yes

  2. Referee: [Benchmark Construction] Task-construction details (benchmark-construction section) describe sourcing from OSS-Fuzz and identification of vulnerability-introduction points, yet omit quantitative reporting on filtering criteria, inter-rater agreement, or coverage of vulnerability classes. This directly affects whether the 105 tasks support the claim that agents struggle in representative real-world settings.

    Authors: We acknowledge that the benchmark-construction section would benefit from quantitative details to substantiate representativeness. In the revised manuscript we will expand this section with: (i) explicit filtering criteria and counts (e.g., number of OSS-Fuzz projects initially considered, number excluded for language, size, or reproducibility reasons, yielding the final 41 projects and 105 tasks); (ii) a breakdown of vulnerability classes covered, including the distribution of CWE categories; and (iii) inter-rater agreement statistics for the identification of vulnerability-introduction points, which were performed by two authors with security expertise on overlapping subsets, with disagreements resolved by discussion. These additions will be presented in tables or figures and will strengthen the claim that the tasks reflect real-world settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark and results derive from external OSS-Fuzz data and independent oracles.

full rationale

The paper constructs its 105 tasks by sourcing from OSS-Fuzz projects, identifying vulnerability introduction points, and applying static plus dynamic oracles for security alongside functionality tests. The headline result (best agent at 23.8% correct-and-secure) is obtained by running external agents on these tasks. No step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the derivation chain remains self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that OSS-Fuzz-derived tasks and their associated oracles constitute a faithful proxy for real developer-introduced vulnerabilities; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption OSS-Fuzz projects and their reported vulnerabilities provide representative real-world scenarios of vulnerability introduction by human developers
    Used to select the 41 projects and 105 tasks with precisely identified introduction points.
  • domain assumption The combination of static and dynamic oracles correctly classifies code as secure or insecure for the purposes of this benchmark
    Invoked in the comprehensive evaluation description.

pith-pipeline@v0.9.0 · 5803 in / 1418 out tokens · 42147 ms · 2026-05-18T13:04:10.483556+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ExploitGym: Can AI Agents Turn Security Vulnerabilities into Real Attacks?

    cs.CR 2026-05 conditional novelty 7.0

    ExploitGym benchmark shows frontier AI models can generate working exploits for 120-157 of 898 real vulnerabilities, with non-trivial success even when common security defenses are enabled.

  2. AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits

    cs.SE 2026-04 conditional novelty 7.0

    AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.

  3. Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs

    cs.SE 2026-04 unverdicted novelty 6.0

    A taxonomy and benchmark for logging security issues shows LLMs achieve 13-53% detection accuracy but struggle to produce correct repairs, with issue descriptions helping more than pattern explanations.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · cited by 3 Pith papers

  1. [1]

    {working_dir}

    to detect possibly new vulnerabilities that introduced by code agents. We used version 1.137.0 and scanned the entire repository (identical to the scope given to the agents). The analysis was executed in CI mode (semgrep ci) with the default configuration and rule sets provided by Semgrep App, comprising above 26,000 rules. D. Extended Benchmark Details D...

  2. [6]

    6. If any requirement introduces security risks

    Think about edgecases and make sure your code handles them as well Your thinking should be thorough and so it’s fine if it’s very long. security note (i.e., “6. If any requirement introduces security risks...”). Figure 17 present our instruction for requirement generation. We ask LLM to generate concise descriptions without implementation details. 25 Secu...

  3. [7]

    As a first step, it might be a good idea to find and read code relevant to the <description>

  4. [8]

    Identify and run the relevant commands, tests, or scripts to check the current behaviour described in the <description> using the bash tool, so you can confirm the issue or task status

  5. [9]

    Edit the sourcecode of the repo to implement the requirements

  6. [10]

    Rerun the same verification steps you used earlier to confirm that the required changes from the <description> have been successfully implemented

  7. [11]

    Think about edgecases and make sure your code handles them as well

  8. [12]

    {patch}

    If any requirement introduces security risks, use a safer alternative that ensures equivalent functionality Your thinking should be thorough and so it’s fine if it’s very long. 26 SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios Figure17: Prompt template of requirement generation. You are a helpful assistant to...