CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

Chuan Xiao; Jingyuan Zhang; Qingfu Zhu; Rain Huang; Shiqi Zhou; Wanxiang Che; Wencong Zeng; Xianzhen Luo; Xing Yue; Yang Yue

arxiv: 2602.03012 · v2 · pith:YZX4EWAFnew · submitted 2026-02-03 · 💻 cs.CR · cs.AI

CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

Xianzhen Luo , Jingyuan Zhang , Shiqi Zhou , Rain Huang , Chuan Xiao , Qingfu Zhu , Zhiyuan Ma , Xing Yue

show 3 more authors

Yang Yue Wencong Zeng Wanxiang Che

This is my paper

Pith reviewed 2026-05-21 14:38 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords CVEvulnerability reproductionmulti-agent frameworkcode securityexecutable tasksbenchmark constructionfine-tuningagentic evaluation

0 comments

The pith

CVE-Factory multi-agent framework converts CVE metadata into executable security tasks at 95% expert correctness

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CVE-Factory as a multi-agent system that turns limited CVE metadata into complete, runnable agentic tasks for testing code security capabilities. This automation tackles the expense and slow pace of manual expert reproductions while keeping tasks current with emerging threats. Validation through direct comparison with human reproductions confirms 95% solution correctness and 96% environment fidelity. The method then supports a live benchmark of 190 tasks across 14 languages and more than 1,000 training environments, which when used for fine-tuning raise model performance from 5.3% to 35.8% on the new benchmark.

Core claim

CVE-Factory is the first multi-agent framework to achieve expert-level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross-validation against human expert reproductions shows 95% solution correctness and 96% environment fidelity. On the latest realistic vulnerabilities the system reaches 66.2% verified success. This automation produces LiveCVEBench, a continuously updated set of 190 tasks spanning 14 languages and 153 repositories, together with over 1,000 synthesized training environments. Fine-tuning Qwen3-32B on the resulting data lifts accuracy from 5.3% to 35.8% on LiveCVEBench while surpassing Claude 4.5 Sonnet and generalizing to a 31.3

What carries the argument

The multi-agent pipeline that converts CVE metadata into fully executable environments and agentic tasks

If this is right

LiveCVEBench supplies a continuously updated collection of 190 executable tasks across 14 languages and 153 repositories that includes emerging AI-tooling vulnerabilities.
Over 1,000 executable training environments are produced at scale for the first time in code security.
Fine-tuned Qwen3-32B reaches 35.8% on LiveCVEBench, exceeding Claude 4.5 Sonnet.
Performance gains transfer to Terminal Bench, rising from 12.5% to 31.3%.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could ingest newly disclosed CVEs in near real time to keep benchmarks and training sets current without repeated manual effort.
Focusing generation on AI-tooling vulnerabilities creates a natural testbed for studying security risks introduced by code-generation agents themselves.
Open release of the framework, benchmark, and dataset allows direct extension to additional languages or integration with other agent evaluation suites.

Load-bearing premise

The automated multi-agent process produces tasks whose difficulty, realism, and lack of bias match those created manually by security experts without systematic artifacts.

What would settle it

Independent human experts reproduce a random sample of the generated tasks and achieve solution success rates significantly below 95% or report consistent differences in setup fidelity.

read the original abstract

Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions. To address these, we present CVE-Factory, the first multi-agent framework to achieve expert-level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross-validation against human expert reproductions shows that CVE-Factory achieves 95\% solution correctness and 96\% environment fidelity, confirming its expert-level quality. It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2\% verified success. This automation enables two downstream contributions. First, we construct LiveCVEBench, a continuously updated benchmark of 190 tasks spanning 14 languages and 153 repositories that captures emerging threats including AI-tooling vulnerabilities. Second, we synthesize over 1,000 executable training environments, the first large-scale scaling of agentic tasks in code security. Fine-tuned Qwen3-32B improves from 5.3\% to 35.8\% on LiveCVEBench, surpassing Claude 4.5 Sonnet, with gains generalizing to Terminal Bench (12.5\% to 31.3\%). We open-source CVE-Factory, LiveCVEBench, Abacus-cve (fine-tuned model), training dataset, and leaderboard. All resources are available at https://github.com/livecvebench/CVE-Factory .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CVE-Factory automates turning CVE metadata into executable security tasks and claims near-expert fidelity, but the validation details are too thin to fully trust the scaling story yet.

read the letter

The main takeaway is that the authors built a multi-agent pipeline to convert sparse CVE metadata into fully executable agentic vulnerability tasks, reporting 95% solution correctness and 96% environment fidelity against human reproductions. They then used it to create LiveCVEBench with 190 tasks across 14 languages and 153 repositories plus over 1,000 training environments, and showed that fine-tuning Qwen3-32B lifts performance from 5.3% to 35.8% on the benchmark while beating Claude 4.5 Sonnet and generalizing somewhat to Terminal Bench. They open-source the factory, benchmark, data, model, and leaderboard, which is the practical move here. The core idea addresses a genuine bottleneck: manual CVE reproduction is slow and lags behind new threats, so automating it at scale could help train better code security agents. The open resources make the work immediately usable for others to test or extend. The fine-tuning gains are concrete numbers that show downstream value if the tasks hold up. The soft spot is the validation. The high agreement figures are the load-bearing claim for calling the output expert-level, yet the abstract gives no sample size, selection method, blinding, or precise metric definitions. Without those, it is hard to rule out that the pipeline was tested mainly on easier or more automatable CVEs or that shared starting metadata created superficial agreement. The 66.2% verified success on recent vulnerabilities also lacks enough context on baselines or controls. If the full paper supplies the missing protocol details and shows the sample was representative, the central argument strengthens; right now it rests on under-specified evidence. This paper is for people building or evaluating AI agents for code security and vulnerability work. Anyone needing more realistic training data or a fresh benchmark in this area can get value from the released artifacts. It deserves a serious referee because the problem is real, the engineering contribution is clear, and the open-sourcing lowers the barrier to checking the claims, even if the evaluation needs tightening.

Referee Report

1 major / 1 minor

Summary. The paper introduces CVE-Factory, a multi-agent framework that automatically transforms CVE metadata into fully executable agentic tasks for code security vulnerabilities. It claims expert-level quality based on cross-validation against human expert reproductions, reporting 95% solution correctness and 96% environment fidelity. The framework is used to construct LiveCVEBench (190 tasks across 14 languages and 153 repositories) and over 1,000 training environments. Fine-tuning Qwen3-32B on the generated data improves performance from 5.3% to 35.8% on LiveCVEBench (surpassing Claude 4.5 Sonnet) with generalization to Terminal Bench (12.5% to 31.3%). All resources are open-sourced.

Significance. If the cross-validation results hold under rigorous scrutiny, the work would be significant for scaling high-quality, executable vulnerability tasks in AI agent security research, addressing the cost and scalability issues of manual reproduction. The open-sourcing of the framework, LiveCVEBench, training dataset, fine-tuned model, and leaderboard is a clear strength that supports reproducibility. The reported downstream gains and generalization provide evidence of practical utility for training security-capable agents on realistic, emerging threats including AI-tooling vulnerabilities.

major comments (1)

[Abstract] Abstract: The central claim that CVE-Factory achieves expert-level quality rests on cross-validation results of 95% solution correctness and 96% environment fidelity versus human expert reproductions. However, the manuscript provides no details whatsoever on the evaluation protocol, including sample size, CVE selection criteria (e.g., random vs. convenience sampling), blinding of evaluators, precise operational definitions of the two metrics, or any statistical controls for inter-rater agreement or variance. This information is required to evaluate whether the high agreement rules out systematic artifacts from the multi-agent pipeline or selection bias toward easily automatable CVEs.

minor comments (1)

[Abstract] Abstract: The 66.2% verified success rate on latest realistic vulnerabilities is reported without defining 'verified success' or describing the evaluation setup, task selection, or success criteria used for this figure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We appreciate the emphasis on transparency in our evaluation methods. Below, we provide a point-by-point response to the major comment and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that CVE-Factory achieves expert-level quality rests on cross-validation results of 95% solution correctness and 96% environment fidelity versus human expert reproductions. However, the manuscript provides no details whatsoever on the evaluation protocol, including sample size, CVE selection criteria (e.g., random vs. convenience sampling), blinding of evaluators, precise operational definitions of the two metrics, or any statistical controls for inter-rater agreement or variance. This information is required to evaluate whether the high agreement rules out systematic artifacts from the multi-agent pipeline or selection bias toward easily automatable CVEs.

Authors: We agree that the manuscript would benefit from more explicit details on the cross-validation protocol to allow readers to fully assess the reliability of our expert-level quality claims. The current version focuses on the results in the abstract and main text but does not elaborate on the protocol in sufficient depth. In the revised manuscript, we will expand the Methods section (or add an Appendix) to include a full description of the evaluation protocol. This will cover the sample size used for the cross-validation, the CVE selection criteria (specifying whether random or convenience sampling was used), the blinding of evaluators, the precise operational definitions of solution correctness and environment fidelity, and statistical controls such as inter-rater agreement measures. These additions will help address concerns regarding potential systematic artifacts from the multi-agent pipeline or selection bias. We are preparing the specific details from our experimental logs for inclusion in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; external human benchmarks anchor core claims

full rationale

The paper's derivation chain centers on a multi-agent pipeline that transforms CVE metadata into executable tasks, with quality asserted via cross-validation against independent human expert reproductions (95% solution correctness, 96% environment fidelity). This external benchmark prevents reduction of the expert-level claim to the pipeline's own outputs by construction. LiveCVEBench and the 1,000+ training environments are generated by the same framework, yet the reported performance gains and comparisons (e.g., fine-tuned model surpassing Claude) are presented as empirical results on those constructed artifacts rather than tautological redefinitions of the inputs. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the chain remains self-contained against the stated human reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework is presented as a new engineering contribution without detailing internal hyperparameters or unstated modeling assumptions.

pith-pipeline@v0.9.0 · 5819 in / 1070 out tokens · 67248 ms · 2026-05-21T14:38:21.277882+00:00 · methodology

CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)