pith. sign in

arxiv: 2605.20051 · v1 · pith:SYHQRVCSnew · submitted 2026-05-19 · 💻 cs.CR

Hunting Vulnerability Variants in AI Infra: Measurement and Reference-Driven Detection

Pith reviewed 2026-05-20 03:46 UTC · model grok-4.3

classification 💻 cs.CR
keywords AI infrastructurevulnerability variantsreference-driven detectionmulti-agent frameworksecurity measurementGitHub repositoriesvulnerability disclosure
0
0 comments X

The pith

AI infrastructure projects share recurrent vulnerable patterns across similar designs, which a reference-driven framework can detect from known disclosures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how AI infrastructure repositories reuse overlapping functionality for model training and deployment, leading to variants of the same vulnerabilities appearing in multiple projects. Analysis of 688 repositories and 251 disclosed vulnerabilities establishes that these shared workflows create a basis for cross-repository recurrence. Building on the measurement, the authors introduce INFRASCOPE, a multi-agent system that extracts transferable semantics from known vulnerability cases and applies them to locate and validate variants in new codebases. Evaluation across 20 real-world repositories surfaced over 20 vulnerabilities, including 11 acknowledged cases and 4 assigned CVEs.

Core claim

AI infra projects frequently implement related model-centric workflows, so vulnerabilities disclosed in one repository recur as variants in others with similar structure. INFRASCOPE extracts vulnerability semantics from public disclosures and deploys them through a reference-driven multi-agent process to scan fresh repositories for matching patterns and confirm their validity.

What carries the argument

INFRASCOPE, a reference-driven multi-agent framework that extracts transferable vulnerability semantics from known disclosures to locate and validate variants in new repositories.

If this is right

  • Shared design patterns in AI infra create concrete opportunities for vulnerability variants to spread across projects.
  • Reference cases from public disclosures supply enough semantic information to guide automated searches in unrelated but related repositories.
  • The detection process can surface previously unknown issues that later receive official acknowledgments and CVE assignments.
  • Measurement across hundreds of repositories confirms that overlapping functionality is common enough to justify systematic variant hunting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Security auditing practices for AI systems could shift toward maintaining libraries of reference vulnerabilities rather than relying solely on per-project scans.
  • Similar recurrence patterns may exist in other domains with high code reuse, such as cloud orchestration or data pipeline tools.
  • Integrating the semantic extraction step with existing static analysis engines could reduce the manual validation burden for discovered variants.

Load-bearing premise

Vulnerability semantics extracted from known disclosures transfer reliably to new repositories without excessive false positives or loss of important context-specific details.

What would settle it

If manual examination of the 20 evaluated repositories shows that the reported vulnerabilities are not genuine variants of the referenced disclosures or if the method produces mostly invalid findings on additional repositories, the transferability claim would not hold.

Figures

Figures reproduced from arXiv: 2605.20051 by Dong Zhang, Hao Chen, Huaien Zhang, Keke Lian, Shaofeng Li, Shoufeng Zhang, Tian Dong, Yanjun Chen, Yunlong Lyu.

Figure 1
Figure 1. Figure 1: AI infra measurement summary. (a) monthly project creation by capability family since 2023, (b) primary-language [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Similar vulnerability patterns in AI infra [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Cross-repository motivating example for vulnerability variants in AI infra found by I [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of INFRASCOPE. The semantic agent derives vulnerability and repository semantics. Then, the inspection agent uses these semantics to prioritize target inspection under memory and context budgets. Finally, the verification agent confirms candidates with static code facts and sandbox validation before reporting findings. 4.2. Semantics Modeling Agents perform poorly when the inspection context is un… view at source ↗
Figure 5
Figure 5. Figure 5: Similarity assessment between repository semantics. The left panel compares generated semantics with human [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: SQL injection variants between two modules within [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

AI infra has become a shared execution layer for model training, deployment, and agent orchestration. Because many projects reimplement similar model-centric workflows, a vulnerability disclosed in one repository can recur as a variant in another repository with a related design. Yet the prevalence and detectability of these variants remain poorly understood. This paper presents a measurement study of vulnerability variants in AI infra. Analyzing 688 GitHub repositories and 251 publicly disclosed vulnerabilities, we find that AI infra projects frequently share overlapping functionality and recurrent vulnerable patterns, creating a concrete basis for cross-repository variants. Building on this finding, we study how to automatically identify such variants from known disclosures. We propose INFRASCOPE, a reference-driven multi-agent framework that extracts transferable vulnerability semantics from known cases and uses them to locate and validate variants in new repositories. Evaluating INFRASCOPE on 20 real-world AI infra repositories, we uncover over 20 vulnerabilities, including 11 acknowledged cases and 4 cases that have been assigned CVEs so far.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports a measurement study analyzing 688 GitHub repositories and 251 publicly disclosed vulnerabilities in AI infrastructure projects, identifying frequent overlapping functionality and recurrent vulnerable patterns that enable cross-repository variants. It proposes INFRASCOPE, a reference-driven multi-agent framework that extracts transferable vulnerability semantics from known disclosures and applies them to locate and validate variants in new repositories. Evaluation on 20 real-world AI infra repositories is reported to uncover over 20 vulnerabilities, including 11 acknowledged cases and 4 assigned CVEs.

Significance. If the evaluation holds, the work provides empirical evidence for the prevalence of vulnerability variants in AI infra and demonstrates a practical reference-driven approach to detection that has already yielded acknowledged issues and CVEs. The measurement on a large set of repositories and use of public disclosures as external references are strengths that ground the claims in observable data rather than self-referential fitting.

major comments (2)
  1. [Evaluation of INFRASCOPE] Evaluation section (description of INFRASCOPE results on 20 repositories): the claim of uncovering over 20 vulnerabilities (11 acknowledged, 4 CVE-assigned) lacks any reported count of total candidates examined, precision/recall figures, or explicit validation criteria and process (manual review, PoC execution, static analysis, or maintainer confirmation). This information is load-bearing for assessing whether semantic extraction succeeds without excessive false positives or context loss.
  2. [Abstract and INFRASCOPE description] Abstract and framework description: the assumption that vulnerability semantics extracted from known disclosures transfer reliably to new repositories is presented without quantitative evidence on false-positive rates or context-specific failure modes, which directly affects the central claim that the framework locates and validates variants.
minor comments (2)
  1. [Measurement study description] The selection criteria for the 688 repositories and 251 vulnerabilities are not detailed in the provided abstract; adding explicit inclusion/exclusion rules would improve reproducibility.
  2. [Framework overview] Consider clarifying the multi-agent architecture of INFRASCOPE with a high-level diagram or pseudocode to make the reference-driven extraction process easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below and have revised the manuscript to strengthen the presentation of the evaluation and framework details.

read point-by-point responses
  1. Referee: [Evaluation of INFRASCOPE] Evaluation section (description of INFRASCOPE results on 20 repositories): the claim of uncovering over 20 vulnerabilities (11 acknowledged, 4 CVE-assigned) lacks any reported count of total candidates examined, precision/recall figures, or explicit validation criteria and process (manual review, PoC execution, static analysis, or maintainer confirmation). This information is load-bearing for assessing whether semantic extraction succeeds without excessive false positives or context loss.

    Authors: We agree that the evaluation requires more granular reporting to allow readers to assess false-positive rates and semantic fidelity. In the revised manuscript we have expanded the Evaluation section to report the total number of candidates examined by INFRASCOPE across the 20 repositories, the resulting precision and recall figures computed from the validation outcomes, and the explicit multi-stage validation criteria and process (initial static filtering, manual code review by the authors, PoC execution where feasible, and direct maintainer confirmation or CVE assignment). These additions directly address concerns about excessive false positives and context loss. revision: yes

  2. Referee: [Abstract and INFRASCOPE description] Abstract and framework description: the assumption that vulnerability semantics extracted from known disclosures transfer reliably to new repositories is presented without quantitative evidence on false-positive rates or context-specific failure modes, which directly affects the central claim that the framework locates and validates variants.

    Authors: The measurement study on 688 repositories and 251 disclosures supplies the empirical basis for recurrent overlapping patterns that enable transfer. To provide the requested quantitative grounding, we have revised the INFRASCOPE description and added a new subsection that reports observed false-positive rates from the 20-repository evaluation together with an analysis of context-specific failure modes (e.g., differences in dependency resolution or API surface) and how the multi-agent extraction mitigates them. This supplies concrete evidence supporting reliable transfer in the evaluated setting. revision: yes

Circularity Check

0 steps flagged

No circularity: results derive from external repositories and disclosures

full rationale

The paper's measurement analyzes 688 GitHub repositories and 251 publicly disclosed vulnerabilities as independent external inputs, identifying overlapping functionality as an empirical observation. INFRASCOPE is then built to extract semantics from these known cases and apply them to new repositories for variant detection. Evaluation on 20 additional real-world repositories reports over 20 vulnerabilities (11 acknowledged, 4 with CVEs), grounded in external data and validation rather than any fitted parameters, self-definitions, or self-citation chains that reduce the central claims to inputs by construction. No equations, ansatzes, uniqueness theorems, or renamings of known results appear in the derivation; the chain remains self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that shared design patterns in AI infra create transferable vulnerability semantics, with the INFRASCOPE framework introduced as the primary new construct.

axioms (1)
  • domain assumption AI infra projects frequently share overlapping functionality and recurrent vulnerable patterns
    Invoked in the abstract as the concrete basis for cross-repository variants and the motivation for INFRASCOPE.
invented entities (1)
  • INFRASCOPE multi-agent framework no independent evidence
    purpose: Extracts transferable vulnerability semantics from known cases to locate and validate variants
    Newly proposed system whose effectiveness is demonstrated only through the paper's own evaluation on 20 repositories.

pith-pipeline@v0.9.0 · 5725 in / 1335 out tokens · 36493 ms · 2026-05-20T03:46:28.273516+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages

  1. [1]

    LlamaFactory: Unified efficient fine-tuning of 100+ language models,

    Y . Zheng, R. Zhang, J. Zhang, Y . Ye, and Z. Luo, “LlamaFactory: Unified efficient fine-tuning of 100+ language models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 3: System Demonstrations). Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 400–410. [Online]. Available: h...

  2. [2]

    vLLM documentation,

    vLLM Project, “vLLM documentation,” Official documentation, 2026, accessed: 2026-04-27. [Online]. Available: https://docs.vllm.ai/

  3. [3]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https: //openreview.net/forum?id=WE_vluYUL-X

  4. [4]

    Infrastructure for AI agents,

    A. Chan, K. Wei, S. Huang, N. Rajkumar, E. Perrier, S. Lazar, G. K. Hadfield, and M. Anderljung, “Infrastructure for AI agents,” Trans. Mach. Learn. Res., vol. 2025, 2025. [Online]. Available: https://openreview.net/forum?id=Ckh17xN2R2

  5. [5]

    Nvd - cve-2025-53002,

    National Vulnerability Database, “Nvd - cve-2025-53002,” National Vulnerability Database, 2025, accessed: 2026-02-20. [Online]. Available: https://nvd.nist.gov/vuln/detail/CVE-2025-53002

  6. [6]

    Claude mythos preview,

    Anthropic, “Claude mythos preview,” Anthropic Alignment Science technical blog, Apr. 2026, published: 2026-04-07. Accessed: 2026-04-08. [Online]. Available: https://red.anthropic.com/2026/ mythos-preview/

  7. [7]

    Vuddy: A scalable approach for vulnerable code clone discovery,

    S. Kim, S. Woo, H. Lee, and H. Oh, “Vuddy: A scalable approach for vulnerable code clone discovery,” in2017 IEEE Symposium on Security and Privacy (SP), 2017, pp. 595–614

  8. [8]

    MOVERY: A precise approach for modified vulnerable code clone discovery from modified Open-Source software components,

    S. Woo, H. Hong, E. Choi, and H. Lee, “MOVERY: A precise approach for modified vulnerable code clone discovery from modified Open-Source software components,” in31st USENIX Security Symposium (USENIX Security 22). Boston, MA: USENIX Association, Aug. 2022, pp. 3037–3053. [Online]. Available: https: //www.usenix.org/conference/usenixsecurity22/presentation/woo

  9. [9]

    V1SCAN: Discovering 1-day vulnerabilities in reused C/C++ open-source software components using code classification techniques,

    S. Woo, E. Choi, H. Lee, and H. Oh, “V1SCAN: Discovering 1-day vulnerabilities in reused C/C++ open-source software components using code classification techniques,” in32nd USENIX Security Symposium (USENIX Security 23). Anaheim, CA: USENIX Association, Aug. 2023, pp. 6541–6556. [Online]. Available: https: //www.usenix.org/conference/usenixsecurity23/pres...

  10. [10]

    A large-scale empirical study of security patches,

    F. Li and V . Paxson, “A large-scale empirical study of security patches,” inProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 2201–2215. [Online]. Available: https://doi.org/10.1145/3133956. 3134072

  11. [11]

    Vmud: Detecting recurring vulnerabilities with multiple fixing functions via function selection and semantic equivalent statement matching,

    K. Huang, C. Lu, Y . Cao, B. Chen, and X. Peng, “Vmud: Detecting recurring vulnerabilities with multiple fixing functions via function selection and semantic equivalent statement matching,” in Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’24. New York, NY , USA: Association for Computing Machinery, 202...

  12. [12]

    FIRE: Combining Multi-Stage filtering with taint analysis for scalable recurring vulnerability detection,

    S. Feng, Y . Wu, W. Xue, S. Pan, D. Zou, Y . Liu, and H. Jin, “FIRE: Combining Multi-Stage filtering with taint analysis for scalable recurring vulnerability detection,” in33rd USENIX Security Symposium (USENIX Security 24). Philadelphia, PA: USENIX Association, Aug. 2024, pp. 1867–1884. [Online]. Available: https:// www.usenix.org/conference/usenixsecuri...

  13. [13]

    Accurate and efficient recurring vulnerability detection for iot firmware,

    H. Xiao, Y . Zhang, M. Shen, C. Lin, C. Zhang, S. Liu, and M. Yang, “Accurate and efficient recurring vulnerability detection for iot firmware,” ser. CCS ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 3317–3331. [Online]. Available: https://doi.org/10.1145/3658644.3670275

  14. [14]

    Iotbec: An accurate and efficient recurring vulnerability detection framework for black box iot devices,

    H. Yang, J. Guo, S. Yang, G. Zhao, Q. Liu, C. Zhang, Z. Tan, L. Shan, Q. Zhou, M. Zhou, J. Tai, and X. Jia, “Iotbec: An accurate and efficient recurring vulnerability detection framework for black box iot devices,” inProceedings of the Network and Distributed System Security Symposium (NDSS). San Diego, CA, USA: The Internet Society, 2026

  15. [15]

    RepoAudit: An autonomous LLM-agent for repository-level code auditing,

    J. Guo, C. Wang, X. Xu, Z. Su, and X. Zhang, “RepoAudit: An autonomous LLM-agent for repository-level code auditing,” in Proceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, Eds., vol. 267. PMLR...

  16. [16]

    IRIS: LLM-assisted static analysis for detecting security vulnerabilities,

    Z. Li, S. Dutta, and M. Naik, “IRIS: LLM-assisted static analysis for detecting security vulnerabilities,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=9LdJDU7E91

  17. [17]

    Llm-based vulnerabil- ity detection at project scale: An empirical study,

    F. Li, J. Jiang, D. Chen, and Y . Xiong, “Llm-based vulnerabil- ity detection at project scale: An empirical study,”arXiv preprint arXiv:2601.19239, 2026

  18. [18]

    Lost in the middle: How language models use long contexts,

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,”Transactions of the association for computational linguis- tics, vol. 12, pp. 157–173, 2024

  19. [19]

    Hybridflow: A flexible and efficient rlhf framework

    G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu, “Hybridflow: A flexible and efficient rlhf framework,” p. 1279–1297, 2025. [Online]. Available: https://doi.org/10.1145/3689031.3696075

  20. [20]

    C-pack: Packaged resources to advance general chinese embedding,

    S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff, “C-pack: Packaged resources to advance general chinese embedding,” 2023

  21. [21]

    National vulnerability database,

    National Institute of Standards and Technology, “National vulnerability database,” Official vulnerability database, 2026, accessed: 2026-04-27. [Online]. Available: https://nvd.nist.gov/

  22. [22]

    Open source vulnerabilities,

    Open Source Vulnerabilities, “Open source vulnerabilities,” Official vulnerability database, 2026, accessed: 2026-04-27. [Online]. Available: https://osv.dev/

  23. [23]

    Github advisory database,

    GitHub, “Github advisory database,” Official vulnerability advisory database, 2026, accessed: 2026-04-27. [Online]. Available: https: //github.com/advisories

  24. [24]

    [Online]

    Huntr, “Huntr,” Vulnerability disclosure platform, 2026, accessed: 2026-04-27. [Online]. Available: https://huntr.com/

  25. [25]

    {LLMxCPG}:{Context-Aware}vulnerability detection through code property{Graph-Guided}large language models,

    A. Lekssays, H. Mouhcine, K. Tran, T. Yu, and I. Khalil, “{LLMxCPG}:{Context-Aware}vulnerability detection through code property{Graph-Guided}large language models,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 489– 507

  26. [26]

    Vulsolver: Vulnerability detection via llm-driven constraint solving,

    X. Li, Y . Su, J. Liu, Z. Lin, Y . Hou, P. Gao, and Y . Zhang, “Vulsolver: Vulnerability detection via llm-driven constraint solving,” arXiv preprint arXiv:2509.00882, 2025

  27. [27]

    Vulnllm-r: Specialized reasoning llm with agent scaffold for vulnerability detection,

    Y . Nie, H. Li, C. Guo, R. Jiang, Z. Wang, B. Li, D. Song, and W. Guo, “Vulnllm-r: Specialized reasoning llm with agent scaffold for vulnerability detection,”arXiv preprint arXiv:2512.07533, 2025

  28. [28]

    Benchmarking LLMs and LLM-based agents in practical vulnerability detection for code repositories,

    A. Yildiz, S. G. Teo, Y . Lou, Y . Feng, C. Wang, and D. M. Divakaran, “Benchmarking LLMs and LLM-based agents in practical vulnerability detection for code repositories,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Vienna, Austria: Association for Computational Linguistics, Jul. 2025...

  29. [29]

    Recurring vulnerability detection: How far are we?

    Y . Cao, S. Wu, R. Wang, B. Chen, Y . Huang, C. Lu, Z. Zhou, and X. Peng, “Recurring vulnerability detection: How far are we?”Proc. ACM Softw. Eng., vol. 2, no. ISSTA, Jun. 2025. [Online]. Available: https://doi.org/10.1145/3728901

  30. [30]

    Memgpt: Towards llms as operating systems,

    C. Packer, S. Wooders, K. Lin, V . Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “Memgpt: Towards llms as operating systems,” 2023

  31. [31]

    Reflexion: language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. R. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” inThirty-seventh Conference on Neural Information Processing Systems, vol. 36. New Orleans, LA, USA: Curran Associates, Inc., 2023, pp. 8634–8652. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2023/file/...

  32. [32]

    LLMLingua: Compressing prompts for accelerated inference of large language models,

    H. Jiang, Q. Wu, C.-Y . Lin, Y . Yang, and L. Qiu, “LLMLingua: Compressing prompts for accelerated inference of large language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 13 358–13 376. [Online]. Available: https://aclanthology.org/2...

  33. [33]

    Survey of hallucination in natural language generation,

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Comput. Surv., vol. 55, no. 12, Mar

  34. [34]

    Survey of hallucination in natural language generation

    [Online]. Available: https://doi.org/10.1145/3571730

  35. [35]

    Codeql documentation,

    GitHub, “Codeql documentation,” Official documentation, 2026, accessed: 2026-04-27. [Online]. Available: https://codeql.github.com/ docs/

  36. [36]

    Agent skills,

    Agent Skills, “Agent skills,” Official documentation, 2026, accessed: 2026-04-22. [Online]. Available: https://agentskills.io/home

  37. [37]

    Deepseek-v3.2: Pushing the frontier of open large language models,

    DeepSeek-AI, “Deepseek-v3.2: Pushing the frontier of open large language models,” 2025

  38. [38]

    Claude code documentation,

    Anthropic, “Claude code documentation,” Official documentation, 2026, accessed: 2026-04-27. [Online]. Available: https://code.claude. com/docs/

  39. [39]

    Vulnhalla,

    CyberArk, “Vulnhalla,” GitHub repository, 2025, accessed: 2026-03-

  40. [40]

    Available: https://github.com/cyberark/Vulnhalla

    [Online]. Available: https://github.com/cyberark/Vulnhalla

  41. [41]

    Vulnhalla: Picking the true vulnerabilities from a codeql haystack,

    S. Kosman, “Vulnhalla: Picking the true vulnerabilities from a codeql haystack,” CyberArk Threat Research Blog, Dec. 2025. [Online]. Available: https://www.cyberark.com/resources/threat-research-blog/ vulnhalla-picking-the-true-vulnerabilities-from-the-codeql-haystack

  42. [42]

    Osv: Ghsa-3hcm- ggvf-rch5 (openclaw),

    Open Source Vulnerabilities (OSV), “Osv: Ghsa-3hcm- ggvf-rch5 (openclaw),” OSV , 2026. [Online]. Available: https://osv.dev/vulnerability/GHSA-3hcm-ggvf-rch5

  43. [43]

    Understanding the reproducibility of crowd-reported security vulnerabilities,

    D. Mu, A. Cuevas, L. Yang, H. Hu, X. Xing, B. Mao, and G. Wang, “Understanding the reproducibility of crowd-reported security vulnerabilities,” in27th USENIX Security Symposium (USENIX Security 18). Baltimore, MD: USENIX Association, Aug. 2018, pp. 919–936. [Online]. Available: https://www.usenix. org/conference/usenixsecurity18/presentation/mu

  44. [44]

    How long do vulnerabilities live in the code? a Large-Scale empirical measurement study on FOSS vulnerability lifetimes,

    N. Alexopoulos, M. Brack, J. P. Wagner, T. Grube, and M. Mühlhäuser, “How long do vulnerabilities live in the code? a Large-Scale empirical measurement study on FOSS vulnerability lifetimes,” in31st USENIX Security Symposium (USENIX Security 22). Boston, MA: USENIX Association, Aug. 2022, pp. 359–376. [Online]. Available: https://www.usenix.org/ conferenc...

  45. [45]

    Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair

    H. Zhang, Y . Pei, J. Chen, and S. H. Tan, “Statfier: Automated testing of static analyzers via semantic-preserving program transformations,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the F oundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY , USA: Association for Computing Machinery, 20...

  46. [46]

    Towards understanding refactoring engine bugs,

    H. Wang, Z. Xu, H. Zhang, N. Tsantalis, and S. H. Tan, “Towards understanding refactoring engine bugs,”ACM Trans. Softw. Eng. Methodol., vol. 35, no. 5, Apr. 2026. [Online]. Available: https://doi.org/10.1145/3747289

  47. [47]

    Rfcaudit: An llm agent for functional bug detection in network protocols,

    M. Zheng, C. Wang, X. Liu, J. Guo, S. Feng, and X. Zhang, “Rfcaudit: An llm agent for functional bug detection in network protocols,”arXiv preprint arXiv:2506.00714, 2025

  48. [48]

    Understanding and detecting annotation-induced faults of static analyzers,

    H. Zhang, Y . Pei, S. Liang, and S. H. Tan, “Understanding and detecting annotation-induced faults of static analyzers,”Proc. ACM Softw. Eng., vol. 1, no. FSE, Jul. 2024. [Online]. Available: https://doi.org/10.1145/3643759

  49. [49]

    Characterizing and detecting program representation faults of static analysis frameworks,

    H. Zhang, Y . Pei, S. Liang, Z. Xing, and S. H. Tan, “Characterizing and detecting program representation faults of static analysis frameworks,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2024. New York, NY , USA: Association for Computing Machinery, 2024, p. 1772–1784. [Online]. Available: h...

  50. [50]

    Llms cannot reliably identify and reason about security vulnera- bilities (yet?): A comprehensive evaluation, framework, and bench- marks,

    S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini, “Llms cannot reliably identify and reason about security vulnera- bilities (yet?): A comprehensive evaluation, framework, and bench- marks,” in2024 IEEE Symposium on Security and Privacy (SP), 2024, pp. 862–880

  51. [51]

    Demystifying rce vulnerabilities in llm-integrated apps,

    T. Liu, Z. Deng, G. Meng, Y . Li, and K. Chen, “Demystifying rce vulnerabilities in llm-integrated apps,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1716–1730. [Online]. Available: https://doi.org/10.1145/3658644.3690338

  52. [52]

    Make agent defeat agent: Automatic detection of Taint-Style vulnerabilities in LLM-based agents,

    F. Liu, Y . Zhang, J. Luo, J. Dai, T. Chen, L. Yuan, Z. Yu, Y . Shi, K. Li, C. Zhou, H. Chen, and M. Yang, “Make agent defeat agent: Automatic detection of Taint-Style vulnerabilities in LLM-based agents,” in34th USENIX Security Symposium (USENIX Security 25). Seattle, W A: USENIX Association, Aug. 2025, pp. 3767–3786. [Online]. Available: https://www.use...

  53. [53]

    Mirrorfuzz: Leveraging llm and shared bugs for deep learning framework apis fuzzing,

    S. Ou, Y . Li, L. Yu, C. Wei, T. Wen, Q. Chen, Y . Chen, H. Tang, and Z. Pan, “Mirrorfuzz: Leveraging llm and shared bugs for deep learning framework apis fuzzing,”IEEE Transactions on Software Engineering, 2025

  54. [54]

    APPATCH: Automated adaptive prompting large language models for Real-World software vulnerability patching,

    Y . Nong, H. Yang, L. Cheng, H. Hu, and H. Cai, “APPATCH: Automated adaptive prompting large language models for Real-World software vulnerability patching,” in34th USENIX Security Symposium (USENIX Security 25). Seattle, W A: USENIX Association, Aug. 2025, pp. 4481–4500. [Online]. Available: https://www.usenix.org/ conference/usenixsecurity25/presentation/nong

  55. [55]

    Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag,

    X. Du, G. Zheng, K. Wang, Y . Zou, Y . Wang, W. Deng, J. Feng, M. Liu, B. Chen, X. Penget al., “Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag,”ACM Transactions on Software Engineering and Methodology, 2024

  56. [56]

    Nvd - cve-2025-1793,

    National Vulnerability Database, “Nvd - cve-2025-1793,” National Vulnerability Database, 2025, accessed: 2026-02-20. [Online]. Available: https://nvd.nist.gov/vuln/detail/CVE-2025-1793

  57. [57]

    Nvd - cve-2025-1750,

    ——, “Nvd - cve-2025-1750,” National Vulnerability Database, 2025, accessed: 2026-02-20. [Online]. Available: https://nvd.nist.gov/ vuln/detail/CVE-2025-1750 Appendix A. Ethical Considerations This study relies only on the public repositories and pub- licly disclosed vulnerabilities. All validation is performed on local checkouts and isolated sandbox envir...

  58. [58]

    ref_doc_id

    counts how many of the seven repositories contain the feature, and Cohort share reports the same count normalized by the seven repository cohort. Within this measurement range, counts of shared features are bounded by the same 12 binary features, while motif prevalence ranges from 0 to 7 repositories. In the current cohort, five of the eight measured feat...