An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors

Amjed Tahir; Chong Wang; Jiaxin Yu; Mojtaba Shahin; Peng Liang; Yangxiao Cai; Yujia Fu

arxiv: 2401.16310 · v6 · submitted 2024-01-29 · 💻 cs.SE · cs.AI

An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors

Jiaxin Yu , Peng Liang , Yujia Fu , Amjed Tahir , Mojtaba Shahin , Chong Wang , Yangxiao Cai This is my paper

Pith reviewed 2026-05-24 04:44 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords security code reviewlarge language modelsstatic analysis toolssecurity defect detectionprompt engineeringcode complexityempirical evaluation

0 comments

The pith

Large language models outperform static analysis tools at detecting security defects in code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can improve security code review by catching defects more reliably than current automated tools. It evaluates seven LLMs across five prompt styles on code samples containing known security issues and directly compares the results to leading static analysis tools. Linguistic and regression analyses then examine response quality problems and the code features that affect success rates. A reader would care because static tools often generate many false positives and miss nuanced defects, leaving manual review slow and error-prone. If the results hold, LLMs could shift security review from tool-heavy to model-assisted workflows.

Core claim

The study finds that LLMs significantly outperform state-of-the-art static analysis tools in security code review, with the reasoning-optimized model DeepSeek-R1 achieving the highest performance. DeepSeek-R1 works best when prompts include both the commit message and chain-of-thought guidance, while GPT-4 via ChatGPT performs best with a Common Weakness Enumeration list in the prompt. GPT-4 often produces vague expressions and struggles to follow instructions exactly, whereas DeepSeek-R1 more frequently generates inaccurate code details. LLMs detect defects more readily in files with fewer tokens and fewer security-relevant annotations, and higher code complexity improves DeepSeek-R1's rate

What carries the argument

Comparative evaluation of seven LLMs under five prompt variants against static analysis baselines on security defect detection, followed by linguistic analysis of outputs and regression on code features.

If this is right

LLMs achieve higher detection performance than static analysis tools across the evaluated scenarios.
Prompt design must be tailored to the specific LLM, as optimal strategies differ between reasoning-optimized and general models.
Detection success rises for code files containing fewer tokens and fewer security annotations.
For DeepSeek-R1, higher code complexity correlates with better detection on certain defect types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams could insert LLMs as an initial filter to reduce the volume of issues passed to human reviewers or static tools.
Response quality problems such as vagueness or inaccurate details suggest that raw LLM output would still require human verification or additional tooling.
The observed sensitivity to code length and complexity points to possible value in segmenting large files before LLM review.

Load-bearing premise

The chosen code samples, defect types, and comparison metrics fairly represent real-world security code review tasks without prompt variations introducing bias that favors the LLMs.

What would settle it

Running the top-performing LLMs on an independent set of production code commits containing verified security defects and checking whether the outperformance over static tools still appears.

Figures

Figures reproduced from arXiv: 2401.16310 by Amjed Tahir, Chong Wang, Jiaxin Yu, Mojtaba Shahin, Peng Liang, Yangxiao Cai, Yujia Fu.

**Figure 1.** Figure 1: Distribution of LoC of the code files with security defects RQ1 - Performance RQ2 - Quality Problems RQ3 - Influential Factors Code file & Commit message Gerrit 5 prompt templates Prompt construction Code review dataset 6 LLMs Responses of the best LLM-prompt combination Data extraction Data labelling Codebook of problems 10 factors Correlation & Redundancy analysis Degree of freedom allocation Regression … view at source ↗

**Figure 2.** Figure 2: An overview of the research procedure for investigating the three RQs 65.1% 8.5% 12.0% C/C++ 16.3% 17.4% 14.9% 32.2% 7.2% Python 0% 25% 50% 75% 100% Buffer Overflow Command Injection Crash CSRF Deadlock DoS Encryption Format String Improper Access Integer Overflow Race Condition Resource Leak Use After Free XSS [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of security defect types on the Python and C/C++ dataset (LoC) of the collected code files in our dataset [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Construction templates for the five prompts ####commit message You should think step-by-step and then provide the final answer. Prompt 6 (Pcot-guardrail): the guardrail version of Pcot Please review the code below to detect security defects. If any are found, please describe the security defect in detail and indicate the corresponding line number of code and solution. If none are found, state: `No security… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: The Phi_K correlation coefficients of explanatory variables degrees of freedom are automatically allocated during model fitting. Model Fitting: Next, we fitted the responses of GPT4+P𝑐𝑖𝑑 and DeepSeek-R1+P𝑐𝑜𝑡−𝑔𝑢𝑎𝑟𝑑𝑟𝑎𝑖𝑙, along with their rating categories onto two separate cumulative link mixed model using the clmm function from the ordinal R package (Christensen and Christensen, 2015). We tested models wi… view at source ↗

**Figure 7.** Figure 7: Dotplot of the Spearman 𝜌 2 between explanatory and response variables across three repetitive experiments [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Response Entropy of each LLM-prompt combinations on the Python dataset 0.57 0.32 0.05 0.17 0.37 0.69 0.60 0.72 0.76 0.79 0.08 0.06 0.18 0.20 0.14 0.18 0.07 0.30 0.61 0.05 0.38 0.34 0.34 0.54 0.78 0.70 0.54 0.54 0.71 0.78 0.60 0.55 0.55 0.56 0.62 0.40 0.18 0.00 0.07 0.22 0.56 0.47 0.58 0.64 0.67 0.03 0.01 0.10 0.11 0.07 0.10 0.02 0.20 0.48 0.01 0.15 0.11 0.11 0.28 0.44 0.43 0.29 0.26 0.41 0.44 0.47 0.43 0.4… view at source ↗

**Figure 9.** Figure 9: Response Entropy of each LLM-prompt combinations on the C/C++ dataset code files with fewer tokens, these additional guidance could be unnecessary or even introduce noise to the LLM. Missing code context can impact the reliability of model outcomes (Li et al., 2025), which remains a challenge when applying LLMs to real-world security code review. We attempted to generate code context using the LLM itself,… view at source ↗

**Figure 10.** Figure 10: Average proportion of each problem type present in responses generated by GPT-4 differences are more likely attributable to model-intrinsic randomness. 3.2. Quality Problems in Responses (RQ2) 3.2.1. RQ2.1 The average distribution of 13 problem types across four themes in the responses of GPT-4+P𝑐𝑖𝑑 and DeepSeekR1+P𝑐𝑜𝑡−𝑔𝑢𝑎𝑟𝑑𝑟𝑎𝑖𝑙 to three repetitive experiments is illustrated in [PITH_FULL_IMAGE:figures… view at source ↗

read the original abstract

Security code review is a time-consuming and labor-intensive process typically requiring integration with automated security defect detection tools. However, existing security analysis tools struggle with poor generalization, high false positive rates, and coarse detection granularity. Large Language Models (LLMs) have been considered promising candidates for addressing those challenges. In this study, we conducted an empirical study to explore the potential of LLMs in detecting security defects during code review. Specifically, we evaluated the performance of seven LLMs under five different prompts and compared them with state-of-the-art static analysis tools. We also performed linguistic and regression analyses for the two top-performing LLMs to identify quality problems in their responses and factors influencing their performance. Our findings show that: (1) In security code review, LLMs significantly outperform state-of-the-art static analysis tools, and the reasoning-optimized LLM performs better than general-purpose LLMs. (2) DeepSeek-R1 achieves the highest performance, followed by GPT-4 provided in the ChatGPT platform. The optimal prompt for DeepSeek-R1 incorporates both the commit message and chain-of-thought (CoT) guidance, while for GPT-4 via ChatGPT, the prompt with a Common Weakness Enumeration (CWE) list works best. (3) GPT-4 via ChatGPT frequently produces vague expressions and exhibits difficulties in accurately following instructions in the prompts, while DeepSeek-R1 more commonly generates inaccurate code details in its outputs. (4) LLMs are more adept at identifying security defects in code files that have fewer tokens and security-relevant annotations. (5) Higher code complexity correlates with enhanced detection capabilities of DeepSeek-R1 for specific security defect types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs beat the static tools on this dataset with some prompt and factor analysis, but the comparison parity needs checking.

read the letter

The main thing to know is that this empirical study finds LLMs, especially DeepSeek-R1, detect security defects better than the static analysis baselines they tested, with additional results on which prompts help and how code length and complexity affect outcomes. The ranking of models and the regression findings on token count and annotations are the concrete new pieces here. The work does a solid job running the same set of prompts across seven models, then doing linguistic checks on the responses for vagueness or wrong details, plus the regression step to surface the factors. That moves it past a simple accuracy table. The soft spot sits in the baseline comparison. The abstract states LLMs significantly outperform the static tools, yet it is not explicit whether those tools received the same extra signals (commit messages, CWE lists) that boosted the LLMs or were run only on defaults. Since the paper already shows LLM results vary with those inputs, an unmatched setup could inflate the gap. Dataset size, language coverage, and exact statistical tests would also need a close look in the full text to judge how far the numbers generalize. This paper is for people working on AI-assisted code review or security tooling in software engineering. A reader who wants prompt recommendations or factor data for LLM use in this domain gets direct value from the results. It deserves a serious referee because the empirical design is straightforward and the topic is relevant, even if the comparison details require tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper reports an empirical study evaluating seven LLMs (including reasoning-optimized models) across five prompt variants for security defect detection during code review. It compares LLM performance against state-of-the-art static analysis tools on the same tasks, then conducts linguistic and regression analyses on the top two LLMs (DeepSeek-R1 and GPT-4 via ChatGPT) to examine response quality issues and factors such as token count, annotations, and code complexity that influence detection rates. The central claims are that LLMs significantly outperform static tools, that prompt design and model type matter, and that specific factors correlate with better LLM performance on certain defect types.

Significance. If the empirical comparisons and factor analyses hold under equivalent conditions, the work supplies concrete evidence that LLMs can address limitations of static analyzers (generalization, false positives, granularity) in security code review and identifies actionable prompt and input characteristics for deployment. The multi-model, multi-prompt design plus regression analysis on influencing factors adds practical value beyond single-model case studies.

major comments (2)

[Findings (1) and (2); Methods section on static-tool baseline] The headline claim in finding (1) that LLMs significantly outperform SOTA static analysis tools is load-bearing, yet the manuscript provides no explicit description of the input parity or configuration given to the static tools. Finding (2) shows that LLM performance improves when prompts include commit messages or CWE lists; if static tools were run only on default settings without equivalent context, the measured gap cannot be attributed solely to model capability (see skeptic note on input parity).
[Dataset and experimental setup; Finding (4)] The evaluation assumes the chosen code samples and defect types fairly represent real-world security review tasks. No details are supplied on dataset size, selection criteria, language distribution, or whether samples were curated post-hoc; combined with finding (4) that LLMs perform better on shorter files with annotations, this raises the possibility that the dataset favors LLM strengths while under-representing the conditions where static tools are typically applied.

minor comments (2)

[Abstract; Finding (2)] The abstract and findings refer to 'DeepSeek-R1' without clarifying whether this is a specific model variant, a fine-tune, or a platform name; consistent nomenclature is needed in the methods section.
[Regression analysis subsection] The regression analysis in the linguistic/regression section would benefit from reporting exact coefficients, p-values, and model fit statistics rather than qualitative statements about correlations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying our experimental setup and dataset. We address the major comments point by point below.

read point-by-point responses

Referee: [Findings (1) and (2); Methods section on static-tool baseline] The headline claim in finding (1) that LLMs significantly outperform SOTA static analysis tools is load-bearing, yet the manuscript provides no explicit description of the input parity or configuration given to the static tools. Finding (2) shows that LLM performance improves when prompts include commit messages or CWE lists; if static tools were run only on default settings without equivalent context, the measured gap cannot be attributed solely to model capability (see skeptic note on input parity).

Authors: We agree that the Methods section does not explicitly describe the configurations and inputs provided to the static analysis tools. Static analysis tools are designed to operate on source code alone and do not natively accept commit messages or CWE lists in the manner of LLM prompts. We ran each tool on the identical code samples used for the LLM evaluations, using their standard/default configurations as is conventional for such baselines. We will revise the manuscript to add an explicit subsection detailing the exact tool versions, settings, and rule sets applied, along with a note that contextual elements like commit messages are outside the design scope of static analyzers. This will strengthen the attribution of the observed performance differences. revision: yes
Referee: [Dataset and experimental setup; Finding (4)] The evaluation assumes the chosen code samples and defect types fairly represent real-world security review tasks. No details are supplied on dataset size, selection criteria, language distribution, or whether samples were curated post-hoc; combined with finding (4) that LLMs perform better on shorter files with annotations, this raises the possibility that the dataset favors LLM strengths while under-representing the conditions where static tools are typically applied.

Authors: We acknowledge that the manuscript lacks a dedicated description of the dataset construction. We will add this information in the revised experimental setup section, including total sample count, selection criteria (e.g., sourcing from public vulnerability repositories and code review benchmarks), language distribution, and any curation steps. While the dataset was assembled to cover common security defect types encountered in code review, we agree that the correlation in finding (4) warrants explicit discussion of potential limitations in representativeness; we will add a limitations paragraph addressing this point. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation against external baselines

full rationale

The paper reports an empirical study that directly measures LLM detection rates on fixed code samples against SOTA static analysis tools, with performance numbers obtained from experiment rather than any derivation, equation, or fitted parameter. No self-definitional constructs, predictions that reduce to inputs by construction, or load-bearing self-citations appear in the reported claims or findings. The central result (LLM outperformance) is therefore independent of the paper's own inputs and stands as a standard empirical comparison.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical human-subject-style evaluation of LLMs on security tasks; no mathematical axioms, free parameters, or invented entities are invoked.

pith-pipeline@v0.9.0 · 5860 in / 1064 out tokens · 20006 ms · 2026-05-24T04:44:56.259920+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review
cs.SE 2026-03 accept novelty 7.0

LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

arXiv preprint arXiv:2308.14434

Using chatgpt as a static application security testing tool. arXiv preprint arXiv:2308.14434 . Blackwell, R.E., Barry, J., Cohn, A.G., 2024. Towards reproducible llm evaluation: Quantifying uncertainty in llm benchmark scores. arXiv preprint arXiv:2410.03492 . Borji, A., 2023. A categorical archive of chatgpt failures. arXiv preprint arXiv:2302.03494 . Bo...

work page arXiv 2024
[2]

Whodoeswhatduringacodereview?datasetsofosspeerreview repositories,in:Proceedingsofthe10thWorkingConferenceonMining Software Repositories (MSR), IEEE. pp. 49–52. Han,X.,Tahir,A.,Liang,P.,Counsell,S.,Blincoe,K.,Li,B.,Luo,Y.,2022. Codesmellsdetectionviamoderncodereview:Astudyoftheopenstack and qt communities. Empirical Software Engineering 27, 127. Harrell, ...

work page arXiv 2022
[3]

lostintheend

Everything you wanted to know about llm-based vulnerability detection but were afraid to ask. arXiv preprint arXiv:2504.13474 . Liu, F., Liu, Y., Shi, L., Huang, H., Wang, R., Yang, Z., Zhang, L., Li, Z., Ma, Y., 2024. Exploring and evaluating hallucinations in llm-powered code generation. arXiv preprint arXiv:2404.00971 . Liu, P., Yuan, W., Fu, J., Jiang...

work page doi:10.1145/3715758 2024
[4]

A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models

A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313 6. Tsai, C.N., Xie, J., Lai, C.M., Lin, C.S., 2025. Leveraging intra-and inter- references in vulnerability detection using multi-agent collaboration based on llms. Cluster Computing 28, 1–12. Turzo, A.K., 2022. Towards improving code re...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.15572150 2025

[1] [1]

arXiv preprint arXiv:2308.14434

Using chatgpt as a static application security testing tool. arXiv preprint arXiv:2308.14434 . Blackwell, R.E., Barry, J., Cohn, A.G., 2024. Towards reproducible llm evaluation: Quantifying uncertainty in llm benchmark scores. arXiv preprint arXiv:2410.03492 . Borji, A., 2023. A categorical archive of chatgpt failures. arXiv preprint arXiv:2302.03494 . Bo...

work page arXiv 2024

[2] [2]

Whodoeswhatduringacodereview?datasetsofosspeerreview repositories,in:Proceedingsofthe10thWorkingConferenceonMining Software Repositories (MSR), IEEE. pp. 49–52. Han,X.,Tahir,A.,Liang,P.,Counsell,S.,Blincoe,K.,Li,B.,Luo,Y.,2022. Codesmellsdetectionviamoderncodereview:Astudyoftheopenstack and qt communities. Empirical Software Engineering 27, 127. Harrell, ...

work page arXiv 2022

[3] [3]

lostintheend

Everything you wanted to know about llm-based vulnerability detection but were afraid to ask. arXiv preprint arXiv:2504.13474 . Liu, F., Liu, Y., Shi, L., Huang, H., Wang, R., Yang, Z., Zhang, L., Li, Z., Ma, Y., 2024. Exploring and evaluating hallucinations in llm-powered code generation. arXiv preprint arXiv:2404.00971 . Liu, P., Yuan, W., Fu, J., Jiang...

work page doi:10.1145/3715758 2024

[4] [4]

A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models

A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313 6. Tsai, C.N., Xie, J., Lai, C.M., Lin, C.S., 2025. Leveraging intra-and inter- references in vulnerability detection using multi-agent collaboration based on llms. Cluster Computing 28, 1–12. Turzo, A.K., 2022. Towards improving code re...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.15572150 2025