Antaeus: Hunting Repository-Level Logic Vulnerabilities via Context-Grounded LLM Reasoning

Lorenzo Cavallaro; Michele Armillotta; Nicol\`o Romandini; Rebecca Montanari

arxiv: 2607.01138 · v1 · pith:LVFRH23Snew · submitted 2026-07-01 · 💻 cs.CR

Antaeus: Hunting Repository-Level Logic Vulnerabilities via Context-Grounded LLM Reasoning

Michele Armillotta , Nicol\`o Romandini , Rebecca Montanari , Lorenzo Cavallaro This is my paper

Pith reviewed 2026-07-02 10:20 UTC · model grok-4.3

classification 💻 cs.CR

keywords logic vulnerabilitiesLLM reasoningrepository analysissecurity invariantsvulnerability detectioncontext groundingapplication security

0 comments

The pith

Antaeus detects logic vulnerabilities by grounding LLMs in repository-level context to infer implicit security invariants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Logic vulnerabilities depend on application-specific security invariants that are implicit and scattered across code, making them hard for LLMs to spot when analyzing functions in isolation. Antaeus addresses this with a pipeline that first ranks functions using lightweight repo-wide security signals, then supplies both local code and a broader view of functionality, security resources, and trust boundaries. For each function the system identifies security-sensitive sinks, derives the conditions required for safe execution, and checks whether those conditions hold locally. Candidate issues then undergo comparative validation to remove patterns that are normal across the project. On 28 repositories containing known logic vulnerabilities, this method detects and explains 15 cases while using token budgets comparable to simpler baselines.

Core claim

Antaeus follows a repository-scale pipeline combining function prioritization, context-grounded reasoning, comparative validation, and structured reporting. It ranks functions using lightweight repo-wide security signals, directs LLM analysis toward relevant code, and for each prioritized function combines local context with a repository-level view of functionality, security resources, and trust boundaries. This enables the model to identify security-sensitive sinks, derive safety conditions for safe execution, check local satisfaction of those conditions, prune findings that reflect project-wide norms, and report sinks, violated conditions, and supporting evidence.

What carries the argument

Repository-scale pipeline that supplies LLMs with both local function code and a repository-level view of functionality, security resources, and trust boundaries.

If this is right

Detects and explains 15 vulnerabilities across the 28 evaluated repositories.
Outperforms function-level and agentic models while using comparable token counts and cost.
Reduces unnecessary LLM calls and triage effort through function prioritization and comparative validation.
Produces traceable reports that list sinks, violated safety conditions, and evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structured context and validation steps could be tested on semantic bugs outside the security domain.
Prioritization and comparative filtering may prove more important than simply increasing raw context-window size.
The pipeline suggests a general template for making LLM code reasoning respect project-wide norms rather than treating every file as independent.

Load-bearing premise

Supplying repository-level context enables frontier LLMs to reliably infer implicit, application-specific security invariants that are otherwise buried among unrelated code.

What would settle it

Running the system on the same 28 repositories and obtaining fewer than 15 detections or no improvement over function-level and agentic baselines despite the added context.

Figures

Figures reproduced from arXiv: 2607.01138 by Lorenzo Cavallaro, Michele Armillotta, Nicol\`o Romandini, Rebecca Montanari.

**Figure 1.** Figure 1: Overview of the ANTAEUS five-stage pipeline: prioritization, context-grounding, structured reasoning, validation, and reporting. 1 func1(args) 2 { 3 callee1(args); 4 callee2(args); 5 ... 6 } 7 ... 8 funcN(args) 9 { 10 callee1(args); 11 ... 12 } Listing 2: Compressed repository representation used during prioritization. Function bodies are reduced to signatures and callees, omitting internal implementation … view at source ↗

read the original abstract

LLM-based vulnerability detectors have shown promising results in identifying memory-safety bugs and vulnerability classes whose violations can often be expressed through established security properties. Logic vulnerabilities, however, pose a different challenge, as their identification requires inferring application-specific security invariants and implicit assumptions about intended behavior. Even frontier agentic models struggle because these invariants are often implicit and buried among unrelated code. Motivated by this gap, we present Antaeus, a framework for detecting logic vulnerabilities that grounds LLM reasoning in repository-level code context. Antaeus follows a repository-scale pipeline combining function prioritization, context-grounded reasoning, comparative validation, and structured reporting. It ranks functions using lightweight repo-wide security signals, directing costly LLM analysis toward relevant code and reducing calls, cost, and triage effort. For each prioritized function, Antaeus combines local code context with a repository-level view of the application's functionality, security resources, and trust boundaries. This enables reasoning about how the function is executed within the broader application rather than as an isolated snippet. Antaeus identifies security-sensitive sinks, derives safety conditions for safe execution, and checks whether they are locally satisfied. Candidate findings undergo comparative validation, pruning concerns that reflect project-wide norms rather than distinctive violations. Finally, Antaeus reports sinks, violated safety conditions, and evidence, making findings actionable and traceable. We evaluate Antaeus on 28 repositories with confirmed logic vulnerabilities and compare it against function-level and agentic models. Antaeus detects and explains 15 vulnerabilities, outperforming baselines with comparable token usage and cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Antaeus gives a four-stage LLM pipeline for logic vulns that prioritizes and validates against repo norms, but the results only show rediscovery on pre-selected buggy repos.

read the letter

The main takeaway is that Antaeus builds a repo-scale pipeline to help LLMs surface logic vulnerabilities by feeding them broader context, but the evaluation only checks whether it can recover already-known issues in chosen repositories.

What is new is the explicit four-stage structure: lightweight prioritization of functions via repo-wide signals, context-grounded reasoning that pulls in application functionality and trust boundaries, comparative validation that drops findings matching project norms, and structured reporting of sinks and violated conditions. The prioritization step is a practical move to limit expensive LLM calls, and the comparative step directly addresses the risk of flagging ordinary code patterns.

The paper reports that this setup detected and explained 15 vulnerabilities across the 28 repositories while using token counts similar to the baselines. That efficiency angle and the focus on actionable output are the parts that stand out as useful engineering.

The soft spot is the evaluation setup. The 28 repositories were picked because they contain confirmed logic vulnerabilities, so the test measures rediscovery rather than blind detection in ordinary code. The abstract supplies no precision numbers, no false-positive rate on clean functions, no control set of repositories without known defects, and no count of total vulnerabilities present. Without those, the claim that repository context lets the model reliably infer implicit invariants stays under-supported.

This is for researchers working on LLM-assisted security tools who need concrete pipelines rather than abstract claims. The architecture is coherent and the cost-reduction ideas are worth discussing, so it deserves a serious referee even though the experimental section will need expansion on prospective testing.

Referee Report

2 major / 0 minor

Summary. The paper presents Antaeus, a framework that combines function prioritization using repo-wide signals, context-grounded LLM reasoning over repository-level views of functionality and trust boundaries, comparative validation against project norms, and structured reporting to detect logic vulnerabilities. It evaluates the approach on 28 repositories containing confirmed logic vulnerabilities, reporting that Antaeus detects and explains 15 vulnerabilities while outperforming function-level and agentic baselines at comparable token usage and cost.

Significance. If the evaluation were strengthened with appropriate controls, the work could advance automated detection of application-specific logic vulnerabilities, a class that is difficult for both static analysis and isolated LLM prompting because it requires inferring implicit invariants. The pipeline design (prioritization to reduce LLM calls, repository-scale context, and comparative pruning) directly targets the limitations noted in the abstract for frontier models.

major comments (2)

[Abstract] Abstract: The headline result states that Antaeus detects and explains 15 vulnerabilities across the 28 repositories, yet the abstract supplies no total count of vulnerabilities present in those repositories, no precision or recall figures, and no false-positive analysis on clean code or on repositories without known defects. These omissions make it impossible to evaluate whether the context-grounded pipeline reliably infers implicit invariants or merely surfaces already-known issues.
[Abstract] Abstract: The 28 repositories were chosen precisely because they contain confirmed logic vulnerabilities. This design measures rediscovery performance rather than prospective detection in repositories without prior knowledge of defects, leaving the central claim—that repository-level context enables reliable inference of buried, application-specific security invariants—untested in the regime the method is intended to address.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation design and abstract. The comments highlight valid points about the scope of our results and the need for clearer reporting. We address each comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline result states that Antaeus detects and explains 15 vulnerabilities across the 28 repositories, yet the abstract supplies no total count of vulnerabilities present in those repositories, no precision or recall figures, and no false-positive analysis on clean code or on repositories without known defects. These omissions make it impossible to evaluate whether the context-grounded pipeline reliably infers implicit invariants or merely surfaces already-known issues.

Authors: We agree that the abstract would benefit from additional quantitative context. The full evaluation in Section 5 reports that Antaeus detected and explained 15 vulnerabilities across the 28 repositories, with all findings manually validated as true positives (yielding 100% precision on reported results). The total number of confirmed vulnerabilities in the repositories is provided in the evaluation setup, and recall is not reported because ground truth is limited to the known cases used for selection. We did not conduct separate experiments on clean repositories without known defects. We will revise the abstract to explicitly state these metrics and the evaluation scope, and add a limitations paragraph clarifying that the results demonstrate the pipeline's effectiveness at surfacing and explaining vulnerabilities via context rather than presupposing defect knowledge. revision: yes
Referee: [Abstract] Abstract: The 28 repositories were chosen precisely because they contain confirmed logic vulnerabilities. This design measures rediscovery performance rather than prospective detection in repositories without prior knowledge of defects, leaving the central claim—that repository-level context enables reliable inference of buried, application-specific security invariants—untested in the regime the method is intended to address.

Authors: We acknowledge that selecting repositories with confirmed vulnerabilities enables ground-truth comparison and measures rediscovery performance. This is a deliberate design choice common in vulnerability detection studies to rigorously assess whether the method can infer the relevant invariants from code context alone. The Antaeus pipeline has no access to vulnerability labels and operates only on repository signals, code, and LLM reasoning; the superior performance over function-level and agentic baselines supports the value of repository-scale context. We agree that prospective detection on repositories without prior defect knowledge would provide complementary evidence. We will revise the abstract and add a dedicated discussion subsection to explicitly frame the evaluation as a controlled study with ground truth and to note the challenges and limitations of prospective evaluation. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical LLM-based detection framework and reports results from running it on 28 pre-selected repositories containing known logic vulnerabilities. No equations, fitted parameters, predictions derived from subsets of data, or self-citation chains appear in the provided text. The detection count (15 vulnerabilities) is presented as a direct empirical outcome of the pipeline rather than a quantity defined by construction from the authors' inputs or prior work. The evaluation setup measures rediscovery on known cases but does not reduce any claimed result to a self-referential definition or fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or additional axioms are stated beyond the implicit reliance on LLM reasoning capability.

axioms (1)

domain assumption Frontier LLMs can infer implicit application-specific security invariants when given repository-scale context
Central to the context-grounded reasoning stage described in the abstract.

pith-pipeline@v0.9.1-grok · 5815 in / 1181 out tokens · 37835 ms · 2026-07-02T10:20:40.546704+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Toward automated detection of logic vulnerabilities in web applications,

V . Felmetsger, L. Cavedon, C. Kruegel, and G. Vigna, “Toward automated detection of logic vulnerabilities in web applications,” in 19th USENIX Security Symposium (USENIX Security 10), 2010

2010
[2]

CWE-200: Exposure of Sensitive Information to an Unau- thorized Actor,

MITRE, “CWE-200: Exposure of Sensitive Information to an Unau- thorized Actor,” https://cwe.mitre.org/data/definitions/200.html, 2026, Common Weakness Enumeration

2026
[3]

CWE-284: Improper Access Control,

——, “CWE-284: Improper Access Control,” https://cwe.mitre.org/ data/definitions/284.html, 2026, Common Weakness Enumeration

2026
[4]

2025 CWE Top 25 Most Dangerous Software Weaknesses,

——, “2025 CWE Top 25 Most Dangerous Software Weaknesses,” https://cwe.mitre.org/top25/archive/2025/2025 cwe top25.html, 2025, Common Weakness Enumeration. Page last updated: December 15, 2025. Accessed: 2026-06-09

2025
[5]

CodeQL Documentation,

GitHub, “CodeQL Documentation,” https://codeql.github.com/docs/, accessed: 2026-05-18

2026
[6]

Semgrep Documentation,

Semgrep, “Semgrep Documentation,” https://semgrep.dev/docs/, accessed: 2026-05-18

2026
[7]

Logicscope: Automatic discovery of logic vul- nerabilities within web applications,

X. Li and Y . Xue, “Logicscope: Automatic discovery of logic vul- nerabilities within web applications,” inProceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security, 2013, pp. 481–486

2013
[8]

Detecting logic vulnerabilities in e- commerce applications

F. Sun, L. Xu, and Z. Su, “Detecting logic vulnerabilities in e- commerce applications.” inNDSS, 2014

2014
[9]

Toward black-box detection of logic flaws in web applications

G. Pellegrino and D. Balzarotti, “Toward black-box detection of logic flaws in web applications.” inNDSS, vol. 14, 2014, pp. 23–26

2014
[10]

Detlogic: A black-box approach for detecting logic vulnerabilities in web applications,

G. Deepa, P. S. Thilagam, A. Praseed, and A. R. Pais, “Detlogic: A black-box approach for detecting logic vulnerabilities in web applications,”Journal of Network and Computer Applications, vol. 109, pp. 89–109, 2018

2018
[11]

RepoAudit: An autonomous LLM-agent for repository-level code auditing,

J. Guo, C. Wang, X. Xu, Z. Su, and X. Zhang, “Repoaudit: An autonomous llm-agent for repository-level code auditing,”arXiv preprint arXiv:2501.18160, 2025

work page arXiv 2025
[12]

Reasoning runtime behavior of a program with llm: How far are we?

J. Chen, Z. Pan, X. Hu, Z. Li, G. Li, and X. Xia, “Reasoning runtime behavior of a program with llm: How far are we?” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 2025, pp. 1869–1881

2025
[13]

Project Glasswing: Securing Critical Software for the AI Era,

Anthropic, “Project Glasswing: Securing Critical Software for the AI Era,” https://www.anthropic.com/project/glasswing, 2026, accessed: 2026-04

2026
[14]

Carlini, N

N. Carlini, N. Cheng, K. Lucas, M. Moore, M. Nasr, V . Prabhushankar, W. Xiao, H. Angulu, E. B. Asher, J. Bow, K. Bradwell, B. Buchanan, D. Forsythe, D. Freeman, A. Gaynor, X. Ge, L. Graham, K. Guru, H. Lakhani, M. McNiece, M. Mehrara, R. Nichol, A. Pirzada, S. Porter, A. Terzis, and K. Troy. Assessing claude mythos preview’s cybersecurity capabilities. [...

2026
[15]

Everything you wanted to know about llm-based vulnerability detec- tion but were afraid to ask,

Y . Li, X. Li, H. Wu, M. Xu, Y . Zhang, X. Cheng, F. Xu, and S. Zhong, “Everything you wanted to know about llm-based vulnerability detec- tion but were afraid to ask,”arXiv preprint arXiv:2504.13474, 2025

work page arXiv 2025
[16]

Llms cannot reliably identify and reason about security vulnera- bilities (yet?): A comprehensive evaluation, framework, and bench- marks,

S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini, “Llms cannot reliably identify and reason about security vulnera- bilities (yet?): A comprehensive evaluation, framework, and bench- marks,” in2024 IEEE symposium on security and privacy (SP). IEEE, 2024, pp. 862–880

2024
[17]

Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories,

A. Yildiz, S. G. Teo, Y . Lou, Y . Feng, C. Wang, and D. M. Divakaran, “Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 30 848–30 865

2025
[18]

Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries

X. Zheng, A. Pesoli, M. Valleri, S. Jana, and L. Cavallaro, “Veritas: A semantically grounded agentic framework for memory corruption vulnerability detection in binaries,”arXiv preprint arXiv:2605.15097, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Demystifying the Mythos or Disrupting Bugonomics? From Zero-Day Asymmetry to Defender Remediation Throughput

A. Pesoli, H. Errico, and L. Cavallaro, “Demystifying the mythos or disrupting bugonomics? from zero-day asymmetry to defender remediation throughput,”arXiv preprint arXiv:2605.24632, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Introducing claude opus 4.8,

Anthropic, “Introducing claude opus 4.8,” https://www.anthropic.co m/news/claude-opus-4-8, 2026

2026
[21]

Introducing gpt-5.4,

OpenAI, “Introducing gpt-5.4,” https://openai.com/index/introducing -gpt-5-4/, 2026

2026
[22]

Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses ,

Y . Ding, Y . Fu, O. Ibrahim, C. Sitawarin, X. Chen, B. Alomair, D. Wagner, B. Ray, and Y . Chen, “Vulnerability detection with code language models: How far are we?” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, ser. ICSE ’25. IEEE Press, 2025, p. 1729–1741. [Online]. Available: https://doi.org/10.1109/ICSE55347.2025.00038

work page doi:10.1109/icse55347.2025.00038 2025
[23]

Language models don’t always say what they think: Unfaithful explanations in chain- of-thought prompting,

M. Turpin, J. Michael, E. Perez, and S. Bowman, “Language models don’t always say what they think: Unfaithful explanations in chain- of-thought prompting,”Advances in Neural Information Processing Systems, vol. 36, pp. 74 952–74 965, 2023

2023
[24]

Bugs as deviant behavior: a general approach to inferring errors in systems code,

D. Engler, D. Y . Chen, S. Hallem, A. Chou, and B. Chelf, “Bugs as deviant behavior: a general approach to inferring errors in systems code,” inProceedings of the Eighteenth ACM Symposium on Operating Systems Principles, ser. SOSP ’01. New York, NY , USA: Association for Computing Machinery, 2001, p. 57–72. [Online]. Available: https://doi.org/10.1145/502...

work page doi:10.1145/502034.502041 2001
[25]

Chucky: exposing missing checks in source code for vulnerability discovery,

F. Yamaguchi, C. Wressnegger, H. Gascon, and K. Rieck, “Chucky: exposing missing checks in source code for vulnerability discovery,” inProceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, ser. CCS ’13. New York, NY , USA: Association for Computing Machinery, 2013, p. 499–510. [Online]. Available: https://doi.org/10.1145/250...

work page doi:10.1145/2508859.2516665 2013
[26]

Sok: Explainable machine learning for computer security applications,

A. Nadeem, D. V os, C. Cao, L. Pajola, S. Dieck, R. Baumgartner, and S. Verwer, “Sok: Explainable machine learning for computer security applications,” in2023 IEEE 8th European symposium on security and privacy (EuroS&P). IEEE, 2023, pp. 221–240

2023
[27]

Unixcoder: Unified cross-modal pre-training for code representation,

D. Guo, S. Lu, N. Duan, Y . Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,” inProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 7212–7225

2022
[28]

Reposvul: A repository-level high-quality vulnerability dataset,

X. Wang, R. Hu, C. Gao, X.-C. Wen, Y . Chen, and Q. Liao, “Reposvul: A repository-level high-quality vulnerability dataset,” in Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, 2024, pp. 472–483

2024
[29]

Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag,

X. Du, G. Zheng, K. Wang, Y . Zou, Y . Wang, W. Deng, J. Feng, M. Liu, B. Chen, X. Peng, T. Ma, and Y . Lou, “Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag,”ACM Trans. Softw. Eng. Methodol., Feb. 2026, just Accepted. [Online]. Available: https://doi.org/10.1145/3797277

work page doi:10.1145/3797277 2026
[30]

Llm-cloudsec: Large language model em- powered automatic and deep vulnerability analysis for intelligent clouds,

D. Cao and W. Jun, “Llm-cloudsec: Large language model em- powered automatic and deep vulnerability analysis for intelligent clouds,” inIEEE INFOCOM 2024 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2024, pp. 1–6

2024
[31]

Iris: Llm-assisted static analysis for detecting security vulnerabilities,

Z. Li, S. Dutta, and M. Naik, “Iris: Llm-assisted static analysis for detecting security vulnerabilities,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 35 735–35 758

2025
[32]

Enhancing static analysis for practical bug detection: An llm-integrated approach,

H. Li, Y . Hao, Y . Zhai, and Z. Qian, “Enhancing static analysis for practical bug detection: An llm-integrated approach,”Proc. ACM Program. Lang., vol. 8, no. OOPSLA1, Apr. 2024. [Online]. Available: https://doi.org/10.1145/3649828

work page doi:10.1145/3649828 2024
[33]

Interleaving static analysis and llm prompting,

P. J. Chapman, C. Rubio-Gonz ´alez, and A. V . Thakur, “Interleaving static analysis and llm prompting,” inProceedings of the 13th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, 2024, pp. 9–17

2024
[34]

{LLMxCPG}:{Context-Aware}vulnerability detection through code property{Graph-Guided}large language models,

A. Lekssays, H. Mouhcine, K. Tran, T. Yu, and I. Khalil, “{LLMxCPG}:{Context-Aware}vulnerability detection through code property{Graph-Guided}large language models,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 489– 507

2025
[35]

From func- tion to repository: Towards repository-level evaluation of software vulnerability detection,

X.-C. Wen, X. Wang, Y . Chen, R. Hu, D. Lo, and C. Gao, “From func- tion to repository: Towards repository-level evaluation of software vulnerability detection,”IEEE Transactions on Software Engineering, 2026. Ethical Considerations This work focuses on the automated detection of logi- cal vulnerabilities in software repositories. Our benchmark consists e...

2026

[1] [1]

Toward automated detection of logic vulnerabilities in web applications,

V . Felmetsger, L. Cavedon, C. Kruegel, and G. Vigna, “Toward automated detection of logic vulnerabilities in web applications,” in 19th USENIX Security Symposium (USENIX Security 10), 2010

2010

[2] [2]

CWE-200: Exposure of Sensitive Information to an Unau- thorized Actor,

MITRE, “CWE-200: Exposure of Sensitive Information to an Unau- thorized Actor,” https://cwe.mitre.org/data/definitions/200.html, 2026, Common Weakness Enumeration

2026

[3] [3]

CWE-284: Improper Access Control,

——, “CWE-284: Improper Access Control,” https://cwe.mitre.org/ data/definitions/284.html, 2026, Common Weakness Enumeration

2026

[4] [4]

2025 CWE Top 25 Most Dangerous Software Weaknesses,

——, “2025 CWE Top 25 Most Dangerous Software Weaknesses,” https://cwe.mitre.org/top25/archive/2025/2025 cwe top25.html, 2025, Common Weakness Enumeration. Page last updated: December 15, 2025. Accessed: 2026-06-09

2025

[5] [5]

CodeQL Documentation,

GitHub, “CodeQL Documentation,” https://codeql.github.com/docs/, accessed: 2026-05-18

2026

[6] [6]

Semgrep Documentation,

Semgrep, “Semgrep Documentation,” https://semgrep.dev/docs/, accessed: 2026-05-18

2026

[7] [7]

Logicscope: Automatic discovery of logic vul- nerabilities within web applications,

X. Li and Y . Xue, “Logicscope: Automatic discovery of logic vul- nerabilities within web applications,” inProceedings of the 8th ACM SIGSAC symposium on Information, computer and communications security, 2013, pp. 481–486

2013

[8] [8]

Detecting logic vulnerabilities in e- commerce applications

F. Sun, L. Xu, and Z. Su, “Detecting logic vulnerabilities in e- commerce applications.” inNDSS, 2014

2014

[9] [9]

Toward black-box detection of logic flaws in web applications

G. Pellegrino and D. Balzarotti, “Toward black-box detection of logic flaws in web applications.” inNDSS, vol. 14, 2014, pp. 23–26

2014

[10] [10]

Detlogic: A black-box approach for detecting logic vulnerabilities in web applications,

G. Deepa, P. S. Thilagam, A. Praseed, and A. R. Pais, “Detlogic: A black-box approach for detecting logic vulnerabilities in web applications,”Journal of Network and Computer Applications, vol. 109, pp. 89–109, 2018

2018

[11] [11]

RepoAudit: An autonomous LLM-agent for repository-level code auditing,

J. Guo, C. Wang, X. Xu, Z. Su, and X. Zhang, “Repoaudit: An autonomous llm-agent for repository-level code auditing,”arXiv preprint arXiv:2501.18160, 2025

work page arXiv 2025

[12] [12]

Reasoning runtime behavior of a program with llm: How far are we?

J. Chen, Z. Pan, X. Hu, Z. Li, G. Li, and X. Xia, “Reasoning runtime behavior of a program with llm: How far are we?” in2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 2025, pp. 1869–1881

2025

[13] [13]

Project Glasswing: Securing Critical Software for the AI Era,

Anthropic, “Project Glasswing: Securing Critical Software for the AI Era,” https://www.anthropic.com/project/glasswing, 2026, accessed: 2026-04

2026

[14] [14]

Carlini, N

N. Carlini, N. Cheng, K. Lucas, M. Moore, M. Nasr, V . Prabhushankar, W. Xiao, H. Angulu, E. B. Asher, J. Bow, K. Bradwell, B. Buchanan, D. Forsythe, D. Freeman, A. Gaynor, X. Ge, L. Graham, K. Guru, H. Lakhani, M. McNiece, M. Mehrara, R. Nichol, A. Pirzada, S. Porter, A. Terzis, and K. Troy. Assessing claude mythos preview’s cybersecurity capabilities. [...

2026

[15] [15]

Everything you wanted to know about llm-based vulnerability detec- tion but were afraid to ask,

Y . Li, X. Li, H. Wu, M. Xu, Y . Zhang, X. Cheng, F. Xu, and S. Zhong, “Everything you wanted to know about llm-based vulnerability detec- tion but were afraid to ask,”arXiv preprint arXiv:2504.13474, 2025

work page arXiv 2025

[16] [16]

Llms cannot reliably identify and reason about security vulnera- bilities (yet?): A comprehensive evaluation, framework, and bench- marks,

S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini, “Llms cannot reliably identify and reason about security vulnera- bilities (yet?): A comprehensive evaluation, framework, and bench- marks,” in2024 IEEE symposium on security and privacy (SP). IEEE, 2024, pp. 862–880

2024

[17] [17]

Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories,

A. Yildiz, S. G. Teo, Y . Lou, Y . Feng, C. Wang, and D. M. Divakaran, “Benchmarking llms and llm-based agents in practical vulnerability detection for code repositories,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 30 848–30 865

2025

[18] [18]

Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries

X. Zheng, A. Pesoli, M. Valleri, S. Jana, and L. Cavallaro, “Veritas: A semantically grounded agentic framework for memory corruption vulnerability detection in binaries,”arXiv preprint arXiv:2605.15097, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Demystifying the Mythos or Disrupting Bugonomics? From Zero-Day Asymmetry to Defender Remediation Throughput

A. Pesoli, H. Errico, and L. Cavallaro, “Demystifying the mythos or disrupting bugonomics? from zero-day asymmetry to defender remediation throughput,”arXiv preprint arXiv:2605.24632, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Introducing claude opus 4.8,

Anthropic, “Introducing claude opus 4.8,” https://www.anthropic.co m/news/claude-opus-4-8, 2026

2026

[21] [21]

Introducing gpt-5.4,

OpenAI, “Introducing gpt-5.4,” https://openai.com/index/introducing -gpt-5-4/, 2026

2026

[22] [22]

Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses ,

Y . Ding, Y . Fu, O. Ibrahim, C. Sitawarin, X. Chen, B. Alomair, D. Wagner, B. Ray, and Y . Chen, “Vulnerability detection with code language models: How far are we?” inProceedings of the IEEE/ACM 47th International Conference on Software Engineering, ser. ICSE ’25. IEEE Press, 2025, p. 1729–1741. [Online]. Available: https://doi.org/10.1109/ICSE55347.2025.00038

work page doi:10.1109/icse55347.2025.00038 2025

[23] [23]

Language models don’t always say what they think: Unfaithful explanations in chain- of-thought prompting,

M. Turpin, J. Michael, E. Perez, and S. Bowman, “Language models don’t always say what they think: Unfaithful explanations in chain- of-thought prompting,”Advances in Neural Information Processing Systems, vol. 36, pp. 74 952–74 965, 2023

2023

[24] [24]

Bugs as deviant behavior: a general approach to inferring errors in systems code,

D. Engler, D. Y . Chen, S. Hallem, A. Chou, and B. Chelf, “Bugs as deviant behavior: a general approach to inferring errors in systems code,” inProceedings of the Eighteenth ACM Symposium on Operating Systems Principles, ser. SOSP ’01. New York, NY , USA: Association for Computing Machinery, 2001, p. 57–72. [Online]. Available: https://doi.org/10.1145/502...

work page doi:10.1145/502034.502041 2001

[25] [25]

Chucky: exposing missing checks in source code for vulnerability discovery,

F. Yamaguchi, C. Wressnegger, H. Gascon, and K. Rieck, “Chucky: exposing missing checks in source code for vulnerability discovery,” inProceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, ser. CCS ’13. New York, NY , USA: Association for Computing Machinery, 2013, p. 499–510. [Online]. Available: https://doi.org/10.1145/250...

work page doi:10.1145/2508859.2516665 2013

[26] [26]

Sok: Explainable machine learning for computer security applications,

A. Nadeem, D. V os, C. Cao, L. Pajola, S. Dieck, R. Baumgartner, and S. Verwer, “Sok: Explainable machine learning for computer security applications,” in2023 IEEE 8th European symposium on security and privacy (EuroS&P). IEEE, 2023, pp. 221–240

2023

[27] [27]

Unixcoder: Unified cross-modal pre-training for code representation,

D. Guo, S. Lu, N. Duan, Y . Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,” inProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 7212–7225

2022

[28] [28]

Reposvul: A repository-level high-quality vulnerability dataset,

X. Wang, R. Hu, C. Gao, X.-C. Wen, Y . Chen, and Q. Liao, “Reposvul: A repository-level high-quality vulnerability dataset,” in Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, 2024, pp. 472–483

2024

[29] [29]

Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag,

X. Du, G. Zheng, K. Wang, Y . Zou, Y . Wang, W. Deng, J. Feng, M. Liu, B. Chen, X. Peng, T. Ma, and Y . Lou, “Vul-rag: Enhancing llm-based vulnerability detection via knowledge-level rag,”ACM Trans. Softw. Eng. Methodol., Feb. 2026, just Accepted. [Online]. Available: https://doi.org/10.1145/3797277

work page doi:10.1145/3797277 2026

[30] [30]

Llm-cloudsec: Large language model em- powered automatic and deep vulnerability analysis for intelligent clouds,

D. Cao and W. Jun, “Llm-cloudsec: Large language model em- powered automatic and deep vulnerability analysis for intelligent clouds,” inIEEE INFOCOM 2024 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2024, pp. 1–6

2024

[31] [31]

Iris: Llm-assisted static analysis for detecting security vulnerabilities,

Z. Li, S. Dutta, and M. Naik, “Iris: Llm-assisted static analysis for detecting security vulnerabilities,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 35 735–35 758

2025

[32] [32]

Enhancing static analysis for practical bug detection: An llm-integrated approach,

H. Li, Y . Hao, Y . Zhai, and Z. Qian, “Enhancing static analysis for practical bug detection: An llm-integrated approach,”Proc. ACM Program. Lang., vol. 8, no. OOPSLA1, Apr. 2024. [Online]. Available: https://doi.org/10.1145/3649828

work page doi:10.1145/3649828 2024

[33] [33]

Interleaving static analysis and llm prompting,

P. J. Chapman, C. Rubio-Gonz ´alez, and A. V . Thakur, “Interleaving static analysis and llm prompting,” inProceedings of the 13th ACM SIGPLAN International Workshop on the State Of the Art in Program Analysis, 2024, pp. 9–17

2024

[34] [34]

{LLMxCPG}:{Context-Aware}vulnerability detection through code property{Graph-Guided}large language models,

A. Lekssays, H. Mouhcine, K. Tran, T. Yu, and I. Khalil, “{LLMxCPG}:{Context-Aware}vulnerability detection through code property{Graph-Guided}large language models,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 489– 507

2025

[35] [35]

From func- tion to repository: Towards repository-level evaluation of software vulnerability detection,

X.-C. Wen, X. Wang, Y . Chen, R. Hu, D. Lo, and C. Gao, “From func- tion to repository: Towards repository-level evaluation of software vulnerability detection,”IEEE Transactions on Software Engineering, 2026. Ethical Considerations This work focuses on the automated detection of logi- cal vulnerabilities in software repositories. Our benchmark consists e...

2026