arxiv: 2604.17860 · v1 · submitted 2026-04-20 · 💻 cs.CR

Recognition: unknown

TitanCA: Lessons from Orchestrating LLM Agents to Discover 100+ CVEs

Ting Zhang , Yikun Li , Chengran Yang , Ratnadira Widyasari , Yue Liu , Ngoc Tan Bui , Phuc Thanh Nguyen , Yan Naing Tun

show 10 more authors

Ivana Clairine Irsan Huu Hung Nguyen Huihui Huang Jinfeng Jiang Lwin Khin Shar Eng Lieh Ouh David Lo Hong Jin Kang Yide Yin Wen Bin Leow

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:44 UTC · model grok-4.3

classification 💻 cs.CR

keywords LLM agentsvulnerability discoveryzero-day vulnerabilitiesCVEsmulti-agent systemsopen-source softwareSASTsecurity pipeline

0 comments

The pith

An orchestrated team of LLM agents discovers 203 zero-day vulnerabilities and yields 118 CVEs in open-source code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TitanCA as a pipeline that coordinates several large language model agents to identify software vulnerabilities more effectively than traditional static analysis tools, which often produce too many false positives. It details a four-part process applied to real open-source projects, where the system located hundreds of previously unknown security flaws that were later confirmed. The authors emphasize the practical architecture and the lessons learned from running this approach in actual software security work. If successful, this method suggests a way to automate deeper code inspection at scale while keeping the rate of incorrect alerts manageable.

Core claim

TitanCA orchestrates LLM-powered agents through a sequence of matching, filtering, inspection, and adaptation modules to scan open-source software, resulting in the discovery of 203 confirmed zero-day vulnerabilities and the assignment of 118 CVEs.

What carries the argument

The four-module architecture of matching, filtering, inspection, and adaptation that coordinates multiple LLM agents for targeted vulnerability detection.

If this is right

Traditional static analysis can be supplemented by LLM orchestration to lower false-positive rates in vulnerability scanning.
Open-source projects gain a practical method for surfacing previously undetected security issues at the scale of hundreds of confirmed cases.
Deployment experience highlights concrete design choices for the filtering and adaptation steps that improve reliability.
The pipeline produces actionable results that translate directly into CVE assignments and fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar agent orchestration could extend to other quality assurance tasks such as detecting logic bugs or performance issues.
The approach may lower the barrier for smaller teams to perform thorough security audits without large manual review efforts.
Lessons on module coordination could guide the construction of multi-agent systems for other domains that require sequential analysis of complex artifacts.

Load-bearing premise

The flagged issues are genuine zero-day vulnerabilities that independent verification can confirm with low false-positive rates, and the orchestration pattern applies beyond the specific projects tested.

What would settle it

Independent security researchers attempting to reproduce the 203 reported vulnerabilities or running the same four-module pipeline on a fresh set of open-source projects and measuring the number of new verified CVEs.

Figures

Figures reproduced from arXiv: 2604.17860 by Chengran Yang, David Lo, Eng Lieh Ouh, Hong Jin Kang, Huihui Huang, Huu Hung Nguyen, Ivana Clairine Irsan, Jinfeng Jiang, Lwin Khin Shar, Ngoc Tan Bui, Phuc Thanh Nguyen, Ratnadira Widyasari, Ting Zhang, Wen Bin Leow, Yan Naing Tun, Yide Yin, Yikun Li, Yue Liu.

read the original abstract

Software vulnerabilities remain one of the most persistent threats to modern digital infrastructure. While static application security testing (SAST) tools have long served as the first line of defense, they suffer from high false-positive rates. This article presents TitanCA, a collaborative project between Singapore Management University and GovTech Singapore that orchestrates multiple large language model (LLM)-powered agents into a unified vulnerability discovery pipeline. Applied in open-source software, TitanCA has discovered 203 confirmed zero-day vulnerabilities and yielded 118 CVEs. We describe the four-module architecture, i.e., matching, filtering, inspection, and adaptation, and share key lessons from building and deploying an LLM-based vulnerability discovery solution in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TitanCA shows an LLM pipeline can produce CVE-level results at scale, but the verification process behind the 203 zero-days is not described in enough detail to assess false positives.

read the letter

The paper's main contribution is the concrete outcome from running a four-module LLM agent system on open source projects: 203 confirmed zero-days that produced 118 CVEs. They break the workflow into matching, filtering, inspection, and adaptation stages and include lessons from actual deployment with government partners. That combination of architecture plus real-world scaling experience is what stands out compared to earlier LLM security papers that stop at smaller experiments or synthetic benchmarks. The practical focus on getting issues accepted by maintainers and assigned CVEs gives it more weight than pure technique descriptions. The soft spot is exactly where the stress-test note flags it. The headline numbers rest on the claim that the flagged issues are true, previously unknown vulnerabilities with low false positives. Yet the text gives no counts of total candidates generated, no acceptance rates from maintainers, no comparison against existing SAST tools, and no explicit confirmation workflow beyond the final CVE step. Without those numbers it is hard to separate effective discovery from high-volume reporting that happens to catch some real issues. This is an empirical systems paper rather than a theoretical one, so the gap is in the evidence for the central result rather than in the modeling. Readers working on LLM agents for code analysis or practical security tooling would get value from the deployment lessons and the scale achieved. It is worth sending to peer review because the topic is relevant and the reported outcome is large enough to matter if the verification details can be strengthened. The current version would benefit from a major revision that adds the missing metrics and baseline comparisons.

Referee Report

2 major / 2 minor

Summary. The paper presents TitanCA, a system orchestrating multiple LLM-powered agents into a four-module pipeline (matching, filtering, inspection, and adaptation) for vulnerability discovery in open-source software. It reports discovering 203 confirmed zero-day vulnerabilities that yielded 118 CVEs and shares practical lessons from building and deploying the system.

Significance. If the reported vulnerabilities are independently validated with demonstrably low false-positive rates, the work would be significant for showing that multi-agent LLM orchestration can surface real, previously unknown security issues in production OSS at scale, providing a practical complement to traditional SAST tools. The emphasis on deployment lessons adds immediate value for the security-engineering community.

major comments (2)

[Abstract and Results] The central claim (203 confirmed zero-days and 118 CVEs) is presented in the abstract and results without any quantitative breakdown of total candidates generated by the pipeline, acceptance/rejection rates at each module, or the explicit confirmation workflow (e.g., maintainer review statistics, CVE assignment timeline, or cross-checks against NVD/other databases). This information is load-bearing for establishing that the outputs are true zero-days rather than unconfirmed reports.
[Evaluation / Results] No precision, recall, or false-positive metrics are supplied for the individual modules or the end-to-end system, nor is there a comparison against baseline SAST tools or alternative LLM prompting strategies. Without these, the contribution of the four-module orchestration cannot be isolated from the headline numbers.

minor comments (2)

[Title and Abstract] The title refers to '100+ CVEs' while the abstract states 118; ensure numerical consistency across title, abstract, and body.
[Architecture] The description of the four modules would benefit from a single summary table listing input/output interfaces and key LLM prompts used in each stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the referee's insightful comments. We appreciate the opportunity to clarify and strengthen our manuscript. Below, we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [Abstract and Results] The central claim (203 confirmed zero-days and 118 CVEs) is presented in the abstract and results without any quantitative breakdown of total candidates generated by the pipeline, acceptance/rejection rates at each module, or the explicit confirmation workflow (e.g., maintainer review statistics, CVE assignment timeline, or cross-checks against NVD/other databases). This information is load-bearing for establishing that the outputs are true zero-days rather than unconfirmed reports.

Authors: We agree with this observation and will revise the manuscript to include the requested quantitative details. Specifically, we will add a table and accompanying text in the Results section detailing the number of candidates entering and exiting each of the four modules, along with acceptance rates. Additionally, we will describe the confirmation workflow, including statistics on maintainer reviews, the timeline for CVE assignments, and how we verified against NVD and other sources to confirm these are zero-days. This information is available from our internal logs and will be summarized appropriately. revision: yes
Referee: [Evaluation / Results] No precision, recall, or false-positive metrics are supplied for the individual modules or the end-to-end system, nor is there a comparison against baseline SAST tools or alternative LLM prompting strategies. Without these, the contribution of the four-module orchestration cannot be isolated from the headline numbers.

Authors: The paper's primary contribution lies in the practical lessons learned from orchestrating LLM agents in a real deployment scenario, rather than in providing a benchmark-style evaluation. As such, we did not compute precision/recall metrics or run systematic comparisons, which would have required a different experimental setup. We will add a paragraph in the Discussion section explaining this focus and noting that the real-world validation through 118 CVEs provides strong evidence of effectiveness. We can partially address by including high-level qualitative comparisons to SAST tools based on observed false positive rates in practice, but full quantitative baselines are beyond the scope of this lessons-oriented work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical systems report with no derivations or self-referential predictions

full rationale

The paper describes a four-module LLM agent pipeline (matching, filtering, inspection, adaptation) applied to open-source software and reports empirical outcomes: 203 confirmed zero-day vulnerabilities yielding 118 CVEs. No equations, fitted parameters, mathematical predictions, or derivation chains exist. Claims rest on external CVE assignment and maintainer validation processes, which are independent of any internal construction. No self-citations form load-bearing premises, no uniqueness theorems are invoked, and no ansatzes or renamings reduce results to inputs. This is a standard non-circular empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim is an empirical report of a deployed multi-agent system. No free parameters, mathematical axioms, or invented entities are required; the work relies on standard LLM capabilities and external CVE confirmation.

pith-pipeline@v0.9.0 · 5491 in / 1097 out tokens · 34123 ms · 2026-05-10T04:44:02.880136+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 2 canonical work pages

[1]

Bui, Tan and Tun, Yan Naing and Nguyen, Thanh Phuc and Su, Yindu and Thung, Ferdian and Li, Yikun and Ang, Han Wei and Yin, Yide and Liauw, Frank and Shar, Lwin Khin and others , journal=
[2]

Weyssow, Martin and Yang, Chengran and Chen, Junkai and Widyasari, Ratnadira and Zhang, Ting and Huang, Huihui and Nguyen, Huu Hung and Tun, Yan Naing and Bui, Tan and Li, Yikun and others , journal=
[3]

Accepted by ACL 2026 arXiv preprint arXiv:2510.01002 , year=

Semantics-Aligned, Curriculum-Driven, and Reasoning-Enhanced Vulnerability Repair Framework , author=. Accepted by ACL 2026 arXiv preprint arXiv:2510.01002 , year=

work page arXiv 2026
[4]

Let the Trial Begin: A Mock-court Approach to Vulnerability Detection using

Widyasari, Ratnadira and Weyssow, Martin and Irsan, Ivana Clairine and Ang, Han Wei and Liauw, Frank and Ouh, Eng Lieh and Shar, Lwin Khin and Kang, Hong Jin and Lo, David , journal=. Let the Trial Begin: A Mock-court Approach to Vulnerability Detection using
[5]

Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25

Li, Yikun and Bui, Ngoc Tan and Zhang, Ting and Yang, Chengran and Zhou, Xin and Weyssow, Martin and Jiang, Jinfeng and Chen, Junkai and Huang, Huihui and Nguyen, Huu Hung and others , journal=. Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25
[6]

Chen, Junkai and Huang, Huihui and Lyu, Yunbo and An, Junwen and Shi, Jieke and Yang, Chengran and Zhang, Ting and Tian, Haoye and Li, Yikun and Li, Zhenhao and others , journal=
[7]

Accepted by ICSE 2026 arXiv preprint arXiv:2507.09199 , year=

Back to the Basics: Rethinking Issue-Commit Linking with LLM-Assisted Retrieval , author=. Accepted by ICSE 2026 arXiv preprint arXiv:2507.09199 , year=

work page arXiv 2026
[8]

Li, Yikun and Zhang, Ting and Widyasari, Ratnadira and Tun, Yan Naing and Nguyen, Huu Hung and Bui, Tan and Irsan, Ivana Clairine and Cheng, Yiran and Lan, Xiang and Ang, Han Wei and others , journal=