CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability
Pith reviewed 2026-05-21 14:38 UTC · model grok-4.3
The pith
CVE-Factory multi-agent framework converts CVE metadata into executable security tasks at 95% expert correctness
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CVE-Factory is the first multi-agent framework to achieve expert-level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross-validation against human expert reproductions shows 95% solution correctness and 96% environment fidelity. On the latest realistic vulnerabilities the system reaches 66.2% verified success. This automation produces LiveCVEBench, a continuously updated set of 190 tasks spanning 14 languages and 153 repositories, together with over 1,000 synthesized training environments. Fine-tuning Qwen3-32B on the resulting data lifts accuracy from 5.3% to 35.8% on LiveCVEBench while surpassing Claude 4.5 Sonnet and generalizing to a 31.3
What carries the argument
The multi-agent pipeline that converts CVE metadata into fully executable environments and agentic tasks
If this is right
- LiveCVEBench supplies a continuously updated collection of 190 executable tasks across 14 languages and 153 repositories that includes emerging AI-tooling vulnerabilities.
- Over 1,000 executable training environments are produced at scale for the first time in code security.
- Fine-tuned Qwen3-32B reaches 35.8% on LiveCVEBench, exceeding Claude 4.5 Sonnet.
- Performance gains transfer to Terminal Bench, rising from 12.5% to 31.3%.
Where Pith is reading between the lines
- The same pipeline could ingest newly disclosed CVEs in near real time to keep benchmarks and training sets current without repeated manual effort.
- Focusing generation on AI-tooling vulnerabilities creates a natural testbed for studying security risks introduced by code-generation agents themselves.
- Open release of the framework, benchmark, and dataset allows direct extension to additional languages or integration with other agent evaluation suites.
Load-bearing premise
The automated multi-agent process produces tasks whose difficulty, realism, and lack of bias match those created manually by security experts without systematic artifacts.
What would settle it
Independent human experts reproduce a random sample of the generated tasks and achieve solution success rates significantly below 95% or report consistent differences in setup fidelity.
read the original abstract
Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions. To address these, we present CVE-Factory, the first multi-agent framework to achieve expert-level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross-validation against human expert reproductions shows that CVE-Factory achieves 95\% solution correctness and 96\% environment fidelity, confirming its expert-level quality. It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2\% verified success. This automation enables two downstream contributions. First, we construct LiveCVEBench, a continuously updated benchmark of 190 tasks spanning 14 languages and 153 repositories that captures emerging threats including AI-tooling vulnerabilities. Second, we synthesize over 1,000 executable training environments, the first large-scale scaling of agentic tasks in code security. Fine-tuned Qwen3-32B improves from 5.3\% to 35.8\% on LiveCVEBench, surpassing Claude 4.5 Sonnet, with gains generalizing to Terminal Bench (12.5\% to 31.3\%). We open-source CVE-Factory, LiveCVEBench, Abacus-cve (fine-tuned model), training dataset, and leaderboard. All resources are available at https://github.com/livecvebench/CVE-Factory .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CVE-Factory, a multi-agent framework that automatically transforms CVE metadata into fully executable agentic tasks for code security vulnerabilities. It claims expert-level quality based on cross-validation against human expert reproductions, reporting 95% solution correctness and 96% environment fidelity. The framework is used to construct LiveCVEBench (190 tasks across 14 languages and 153 repositories) and over 1,000 training environments. Fine-tuning Qwen3-32B on the generated data improves performance from 5.3% to 35.8% on LiveCVEBench (surpassing Claude 4.5 Sonnet) with generalization to Terminal Bench (12.5% to 31.3%). All resources are open-sourced.
Significance. If the cross-validation results hold under rigorous scrutiny, the work would be significant for scaling high-quality, executable vulnerability tasks in AI agent security research, addressing the cost and scalability issues of manual reproduction. The open-sourcing of the framework, LiveCVEBench, training dataset, fine-tuned model, and leaderboard is a clear strength that supports reproducibility. The reported downstream gains and generalization provide evidence of practical utility for training security-capable agents on realistic, emerging threats including AI-tooling vulnerabilities.
major comments (1)
- [Abstract] Abstract: The central claim that CVE-Factory achieves expert-level quality rests on cross-validation results of 95% solution correctness and 96% environment fidelity versus human expert reproductions. However, the manuscript provides no details whatsoever on the evaluation protocol, including sample size, CVE selection criteria (e.g., random vs. convenience sampling), blinding of evaluators, precise operational definitions of the two metrics, or any statistical controls for inter-rater agreement or variance. This information is required to evaluate whether the high agreement rules out systematic artifacts from the multi-agent pipeline or selection bias toward easily automatable CVEs.
minor comments (1)
- [Abstract] Abstract: The 66.2% verified success rate on latest realistic vulnerabilities is reported without defining 'verified success' or describing the evaluation setup, task selection, or success criteria used for this figure.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We appreciate the emphasis on transparency in our evaluation methods. Below, we provide a point-by-point response to the major comment and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that CVE-Factory achieves expert-level quality rests on cross-validation results of 95% solution correctness and 96% environment fidelity versus human expert reproductions. However, the manuscript provides no details whatsoever on the evaluation protocol, including sample size, CVE selection criteria (e.g., random vs. convenience sampling), blinding of evaluators, precise operational definitions of the two metrics, or any statistical controls for inter-rater agreement or variance. This information is required to evaluate whether the high agreement rules out systematic artifacts from the multi-agent pipeline or selection bias toward easily automatable CVEs.
Authors: We agree that the manuscript would benefit from more explicit details on the cross-validation protocol to allow readers to fully assess the reliability of our expert-level quality claims. The current version focuses on the results in the abstract and main text but does not elaborate on the protocol in sufficient depth. In the revised manuscript, we will expand the Methods section (or add an Appendix) to include a full description of the evaluation protocol. This will cover the sample size used for the cross-validation, the CVE selection criteria (specifying whether random or convenience sampling was used), the blinding of evaluators, the precise operational definitions of solution correctness and environment fidelity, and statistical controls such as inter-rater agreement measures. These additions will help address concerns regarding potential systematic artifacts from the multi-agent pipeline or selection bias. We are preparing the specific details from our experimental logs for inclusion in the revision. revision: yes
Circularity Check
No significant circularity; external human benchmarks anchor core claims
full rationale
The paper's derivation chain centers on a multi-agent pipeline that transforms CVE metadata into executable tasks, with quality asserted via cross-validation against independent human expert reproductions (95% solution correctness, 96% environment fidelity). This external benchmark prevents reduction of the expert-level claim to the pipeline's own outputs by construction. LiveCVEBench and the 1,000+ training environments are generated by the same framework, yet the reported performance gains and comparisons (e.g., fine-tuned model surpassing Claude) are presented as empirical results on those constructed artifacts rather than tautological redefinitions of the inputs. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the chain remains self-contained against the stated human reference.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.