pith. sign in

arxiv: 2603.18740 · v2 · submitted 2026-03-19 · 💻 cs.SE · cs.AI· cs.CR

Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review

Pith reviewed 2026-05-15 08:56 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR
keywords LLMautomated code reviewcontextual biasvulnerability detectionframing effectsecurity automationsupply chain attack
0
0 comments X

The pith

LLM-assisted code review can be fooled 100% of the time by iteratively refined pull request metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that presenting information in pull requests can systematically bias large language models when they review code for security issues. This framing effect appears consistently across multiple models, with neutral or bug-free descriptions producing the strongest misjudgments. The authors then show that a simple direct attack using templates often fails or alerts reviewers, but an iterative process where attackers refine their framing against a local copy of the review system succeeds in every tested case. This works because attackers can experiment repeatedly while real defenders see only the final submission. Removing metadata or adding clear instructions to the model eliminates the bias in all cases examined.

Core claim

LLM-based automated code review is vulnerable to contextual bias from PR metadata framing. A novel iterative refinement attack, which refines prompts against a cloned pipeline, achieves 100% success in causing the model to miss real vulnerabilities across 17 CVEs in 10 projects, while template attacks are ineffective. The success stems from an asymmetry where attackers iterate freely but defenders have only one detection opportunity. Metadata redaction and explicit instructions restore accurate detection in all affected cases.

What carries the argument

The framing effect, where the presentation of PR metadata overrides its semantic content in forming LLM security judgments, exploited through contextual-bias injection.

If this is right

  • Template-based attacks often backfire by raising reviewer suspicions.
  • The iterative attack succeeds every time due to the attacker iteration advantage.
  • Debiasing through metadata removal and explicit instructions eliminates the evasion in tested cases.
  • Over-reliance on ACR without human oversight creates security risks in the development process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attacker-defender asymmetry may apply to other LLM security tasks that allow prompt iteration.
  • Production pipelines should default to stripping untrusted metadata before LLM review.
  • Contributor reputation and human review become more important safeguards when automated checks can be bypassed.

Load-bearing premise

The controlled tests using 17 CVEs across 10 projects and six LLMs reflect how real production ACR pipelines respond to varied workflows and configurations.

What would settle it

Applying the iterative refinement attack to a live production ACR system and observing whether it evades detection at the same rate as in the controlled experiments.

Figures

Figures reproduced from arXiv: 2603.18740 by Dimitris Mitropoulos, Diomidis Spinellis, Georgios Alexopoulos, Nikolaos Alexopoulos.

Figure 1
Figure 1. Figure 1: Applying our end-to-end workflow to automatically generate a passing PR that reintroduces CVE-2024-56143 [ [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ACR workflow for the strapi [70] project. Model Evaluation Framing Dataset Query Generation Vulnerable / patched pairs Detection Extraction Different bias levels Analysis Quantitative Qualitative Detection Corpus [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our controlled bias experiment. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Supply-chain attack threat model: adversary crafts [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Anatomy of a template-based adversarial pull request for [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Parameterized prompts used in our attack pipeline. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: LLM-assisted attack: In case of rejection the adver [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

Automated Code Review (ACR) systems integrating Large Language Models (LLMs) are increasingly adopted in software development workflows, ranging from interactive assistants to autonomous agents in CI/CD pipelines. In this paper, we study how LLM-based vulnerability detection in ACR is affected by the framing effect: the tendency to let the presentation of information override its semantic content in forming judgments. We examine whether adversaries can exploit this through contextual-bias injection: crafting PR metadata to bias ACR security judgments as a supply-chain attack vector against real-world ACR pipelines. To this end, we first conduct a large-scale exploratory study across 6 LLMs under five framing conditions, establishing the framing effect as a systematic and widespread phenomenon in LLM-based vulnerability detection, with bug-free framing producing the strongest effect. We then design a realistic and controlled experimental environment, evaluating 17 CVEs across 10 real-world projects, to assess the susceptibility of real-world ACR pipelines to vulnerability reintroduction attacks. We employ two attack strategies: a template-based attack inspired by prior related work, and a novel LLM-assisted iterative refinement attack. We find that template-based attacks are ineffective and may even backfire, as direct biasing attempts raise suspicions. Our iterative refinement attack, on the other hand, achieves 100% success, exploiting a fundamental asymmetry: attackers can iteratively refine attacks against a local clone of the review pipeline, while defenders have only one chance to detect them. Debiasing via metadata redaction and explicit instructions restores detection in all affected cases. Overall, our findings highlight the dangers of over-relying on ACR and stress the importance of human oversight and contributor trust in the development process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper examines the framing effect in LLM-based automated code review (ACR) for security vulnerabilities. Through a large-scale exploratory study across 6 LLMs under five framing conditions, it shows that contextual framing—particularly bug-free framing—systematically biases vulnerability detection. It then evaluates susceptibility in a controlled setup with 17 CVEs across 10 real-world projects using two attack strategies: template-based attacks (found ineffective or counterproductive) and a novel LLM-assisted iterative refinement attack (achieving 100% success). The work highlights an asymmetry where attackers can refine against a local clone of the pipeline while defenders have limited detection opportunities, and demonstrates that debiasing via metadata redaction and explicit instructions restores detection.

Significance. If the results hold, the findings are significant for highlighting a practical supply-chain attack vector against LLM-integrated ACR systems. The large-scale empirical measurements across multiple models and real CVEs provide concrete evidence of the framing effect and its exploitability, underscoring risks of over-reliance on automated security reviews and the value of human oversight. The controlled evaluation on actual projects and the identification of effective debiasing methods offer actionable implications for secure development workflows.

major comments (1)
  1. Iterative refinement attack and asymmetry claim: The 100% success rate and the central claim of a 'fundamental asymmetry' (attackers iteratively refine against a local clone while defenders have one chance) is load-bearing for the supply-chain attack argument. However, all experiments use a fully controlled environment with known LLMs and fixed prompts across the 17 CVEs. No tests assess transfer when the clone differs from the target (e.g., model version, temperature, system prompt, or additional filters common in production CI/CD), which directly challenges whether the asymmetry holds outside the lab setup.
minor comments (2)
  1. Abstract: Limited detail is provided on exact success metrics, statistical tests performed, or potential confounds in the iterative attack refinement process.
  2. Experimental description: Clarify the number of independent runs, variance, or significance testing supporting the 100% success rate to strengthen replicability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and insightful comments on our work. We address the major comment below, acknowledging the controlled nature of our experiments while defending the core claims based on the evidence presented. We have made partial revisions to strengthen the discussion of limitations and generalizability.

read point-by-point responses
  1. Referee: Iterative refinement attack and asymmetry claim: The 100% success rate and the central claim of a 'fundamental asymmetry' (attackers iteratively refine against a local clone while defenders have one chance) is load-bearing for the supply-chain attack argument. However, all experiments use a fully controlled environment with known LLMs and fixed prompts across the 17 CVEs. No tests assess transfer when the clone differs from the target (e.g., model version, temperature, system prompt, or additional filters common in production CI/CD), which directly challenges whether the asymmetry holds outside the lab setup.

    Authors: We agree that our evaluation was performed in a controlled environment using known LLMs and fixed prompts, which enabled precise measurement of the 100% success rate for the iterative refinement attack across the 17 CVEs. This setup serves as a realistic proxy for many ACR pipelines that rely on standard model configurations. The asymmetry claim is grounded in the fundamental difference in access: an attacker can perform unlimited local iterations against a cloned pipeline (using publicly available models and prompts), while a defender processes each incoming PR only once without prior knowledge of the attack. We did not empirically test transferability across mismatched configurations such as model versions, temperature settings, or production filters, which is a valid limitation. However, the attack's iterative nature allows adaptation to approximate target behaviors, and the supply-chain risk remains relevant for pipelines using similar open models. We have revised the manuscript to add an explicit limitations subsection discussing generalizability to production environments and to qualify the asymmetry claim accordingly. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical results from controlled experiments

full rationale

The paper reports empirical findings from a large-scale study across 6 LLMs and controlled experiments on 17 CVEs in 10 real-world projects. The 100% success of the iterative refinement attack is presented as a direct experimental outcome in a fully specified local clone setup, not as a derived prediction from fitted parameters or self-referential definitions. The 'fundamental asymmetry' is a qualitative description of attacker/defender capabilities in that setup rather than a mathematical claim that reduces to inputs by construction. No equations, uniqueness theorems, or ansatzes are invoked that would trigger the enumerated circularity patterns. Self-citations (if present) are not load-bearing for the central empirical claims, which remain falsifiable against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations of LLM behavior under framing conditions and attack success rates measured against real CVEs; no free parameters, ad-hoc axioms, or invented entities are introduced.

axioms (1)
  • domain assumption LLMs exhibit framing effects when processing code review tasks
    Invoked as the basis for the exploratory study across framing conditions.

pith-pipeline@v0.9.0 · 5610 in / 1134 out tokens · 56693 ms · 2026-05-15T08:56:01.174658+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

    cs.CR 2026-05 unverdicted novelty 5.0

    Dual-mode benchmarks reveal frontier LLMs have high false positives and low vulnerability coverage in cybersecurity tasks while domain-specialized models reach over 50% per-family detection and 0.904 precision, indica...

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Nadia Alshahwan, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang. 2024. Assured Offline LLM-Based Software Engineer- ing. InProc. InteNSE. 7–12. doi:10.1145/3643661.3643953

  2. [2]

    Anthropics. 2026. claude-code-action GitHub repository. https://github.com/ant hropics/claude-code-action. Accessed: 2026

  3. [3]

    Anthropics. 2026. Claude Code documentation. https://code.claude.com/docs/e n/overview. Accessed: 2026

  4. [4]

    Fannar Steinn Aðalsteinsson et al. 2025. Rethinking Code Review Workflows with LLM Assistance: An Empirical Study. In2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 488–497. doi:10 .1109/esem64174.2025.00013

  5. [5]

    Riccardo Cantini, Alessio Orsino, Massimo Ruggiero, and Domenico Talia. 2025. Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge.Machine Learning 114 (2025), 249. doi:10.1007/s10994-025-06862-6

  6. [6]

    Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2022. Deep Learning Based Vulnerability Detection: Are We There Yet?IEEE Transac- tions on Software Engineering48, 9 (2022), 3280–3296. doi:10.1109/TSE.2021.308 7402

  7. [7]

    Siduo Chen. 2025. Cognitive Biases in Large Language Model based Decision Mak- ing: Insights and Mitigation Strategies.Applied and Computational Engineering 138, 1 (March 2025), 167–174. doi:10.54254/2755-2721/2025.21389

  8. [8]

    Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. 2025. SecAlign: Defending Against Prompt Injection with Preference Optimization. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security(Taipei, Taiwan)(CCS ’25). Association for Computing Machinery, New York, NY, USA, 2833...

  9. [9]

    Wei Chen et al. 2025. From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning. doi:10.48550/arXiv.2409.01658 Pre-print on arXiv

  10. [10]

    Yujia Chen. 2025. AutoReview: An LLM-based Multi-Agent System for Security Issue-Oriented Code Review. InProceedings of the 33rd ACM International Con- ference on the Foundations of Software Engineering (FSE Companion ’25). ACM, 1022–1024. doi:10.1145/3696630.3728618

  11. [11]

    Vanessa Cheung, Maximilian Maier, and Falk Lieder. 2025. Large language models show amplified cognitive biases in moral decision-making.Proceedings of the National Academy of Sciences122, 25 (2025), e2412015122

  12. [12]

    Md Atique Reza Chowdhury et al. 2022. On the Untriviality of Trivial Packages: An Empirical Study of npm JavaScript Packages.IEEE Transactions on Software Engineering48, 8 (Aug. 2022), 2695–2708. doi:10.1109/tse.2021.3068901

  13. [13]

    CodeRabbit. 2025. CodeRabbit: AI Code Reviews. https://www.coderabbit.ai/ Accessed: 2026-02-02

  14. [14]

    In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

    Roland Croft, M. Ali Babar, and M. Mehdi Kholoosi. 2023. Data Quality for Soft- ware Vulnerability Datasets. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 121–133. doi:10.1109/ICSE48619.2023.00022

  15. [15]

    Andreas Dann, Henrik Plate, Ben Hermann, Serena Elisa Ponta, and Eric Bodden

  16. [16]

    2022), 3613–3625

    Identifying Challenges for OSS Vulnerability Scanners — A Study & Test Suite.IEEE Transactions on Software Engineering48, 9 (Sept. 2022), 3613–3625. doi:10.1109/tse.2021.3101739

  17. [17]

    Yangruibo Ding et al. 2025. Vulnerability Detection with Code Language Models: How Far Are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 1729–1741. doi:10.1109/ICSE55347.2025.00038

  18. [18]

    Qingxiu Dong et al. 2024. A Survey on In-context Learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1107–1128. doi:10.18653/v1/2024.emnlp-main.64

  19. [19]

    Georgios-Petros Drosos, Thodoris Sotiropoulos, Diomidis Spinellis, and Dimitris Mitropoulos. 2024. Bloat beneath Python’s Scales: A Fine-Grained Inter-Project Dependency Analysis.Proc. ACM Softw. Eng.1, FSE, Article 114 (July 2024), 24 pages. doi:10.1145/3660821

  20. [20]

    Samuele D’Avenia and Valerio Basile. 2025. Quantifying the Influence of Irrelevant Contexts on Political Opinions Produced by LLMs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). Association for Computational Linguistics, 434–454. doi:10 .18653/v1/2025.acl-srw.28

  21. [21]

    Daniel E. O’Leary. 2025. An Anchoring Effect in Large Language Models.IEEE Intelligent Systems40, 2 (2025), 23–26. doi:10.1109/MIS.2025.3544939

  22. [22]

    Aaron Fanous et al. 2025. SycEval: Evaluating LLM Sycophancy.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8, 1 (Oct. 2025), 893–900. doi:10.1609/aies.v8i1.36598

  23. [23]

    Andres Freund. 2024. Backdoor in XZ Utils. Public disclosure and technical analysis. https://www.openwall.com/lists/oss-security/2024/03/29/4

  24. [24]

    Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: a transformer- based line-level vulnerability prediction. InProceedings of the 19th Interna- tional Conference on Mining Software Repositories (MSR ’22). ACM, 608–620. doi:10.1145/3524842.3528452

  25. [25]

    Federico Germani and Giovanni Spitale. 2025. Source framing triggers systematic bias in large language models.Science Advances11, 45 (Nov. 2025). doi:10.1126/ sciadv.adz2924

  26. [26]

    GitHub. 2024. How GitHub Copilot Works. https://docs.github.com/en/copilot/o verview-of-github-copilot/about-github-copilot. Accessed: 2025

  27. [27]

    GitHub. 2026. GitHub Actions. https://github.com/features/actions. Accessed: 2026

  28. [28]

    Greptile. 2026. Greptile: The AI Code Reviewer. https://www.greptile.com/ Accessed: 2026-02-02

  29. [29]

    Anshul Gupta and Neel Sundaresan. 2018. Intelligent code reviews using deep learning. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18) Deep Learning Day

  30. [30]

    Hazim Hanif and Sergio Maffeis. 2022. VulBERTa: Simplified Source Code Pre- Training for Vulnerability Detection. In2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8. doi:10.1109/ijcnn55064.2022.9892280

  31. [31]

    Jingxuan He and Martin Vechev. 2023. Large Language Models for Code: Security Hardening and Adversarial Testing. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security(Copenhagen, Denmark) (CCS ’23). Association for Computing Machinery, New York, NY, USA, 1865–1879. doi:10.1145/3576915.3623175

  32. [32]

    Joseph Hejderup, Arie van Deursen, and Georgios Gousios. 2018. Software ecosystem call graph for dependency management. InProceedings of the 40th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE ’18). ACM, 101–104. doi:10.1145/3183399.3183417

  33. [33]

    Jellyfish. 2025. 2025 AI Metrics in Review: What 12 Months of Data Tell Us About Adoption and Impact. https://jellyfish.co/blog/2025-ai-metrics-in-review/ Accessed: 2026-02-02

  34. [34]

    Haolin Jin and Huaming Chen. 2025. Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 3819–3823. doi:10.1109/ASE63991.2025.00323

  35. [35]

    Sungwon Kim and Daniel Khashabi. 2025. Challenging the Evaluator: LLM Sycophancy Under User Rebuttal. InFindings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China, 22461–22478. https://aclanthology.org/2025.findings-emnlp.1222/

  36. [36]

    Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom. 2020. Spectre attacks: exploiting speculative execution. Commun. ACM63, 7 (June 2020), 93–101. doi:10.1145/3399742

  37. [37]

    Ahmed Lekssays, Hamza Mouhcine, Khang Tran, Ting Yu, and Issa Khalil. 2025. LLMxCPG: Context-Aware vulnerability detection through code property Graph- Guided large language models. In34th USENIX Security Symposium (USENIX Security 25). 489–507

  38. [38]

    Lvxue Li et al. 2024. Debiasing In-Context Learning by Instructing LLMs How to Follow Demonstrations. InFindings of the Association for Computational Linguistics ACL 2024. Association for Computational Linguistics, 7203–7215. doi:10.18653/v1/2024.findings-acl.430

  39. [39]

    Zhiyu Li et al. 2022. Automating code review activities by large-scale pre-training. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’22). ACM, 1035–1047. doi:10.1145/3540250.3549081

  40. [40]

    Jie Lin and David Mohaisen. 2025. From Large to Mammoth: A Comparative Evaluation of Large Language Models in Zero-Shot Vulnerability Detection. In Proceedings 2025 Network and Distributed System Security Symposium (NDSS 2025). Internet Society. doi:10.14722/ndss.2025.241491

  41. [41]

    2025.Unveiling the Critical At- tack Path for Implanting Backdoors in Supply Chains: Practical Experience from XZ

    Mario Lins, René Mayrhofer, and Michael Roland. 2025.Unveiling the Critical At- tack Path for Implanting Backdoors in Supply Chains: Practical Experience from XZ. , , D. Mitropoulos, N. Alexopoulos, G. Alexopoulos, and D. Spinellis Springer Nature Singapore, 521–541. doi:10.1007/978-981-95-4434-9_24

  42. [42]

    Yang Liu, Armstrong Foundjem, Foutse Khomh, and Heng Li. 2025. Adversarial Attack Classification and Robustness Testing for Large Language Models for Code. Empirical Software Engineering30, 5 (2025). doi:10.1007/s10664-025-10693-3

  43. [43]

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadel- phia, PA, 1831–1847. https://www.usenix.org/conference/usenixsecurity24/pre sentation/liu-yupei

  44. [44]

    Jiaxu Lou and Yifan Sun. 2025. Anchoring bias in large language models: an experimental study.Journal of Computational Social Science9, 11 (Dec. 2025). doi:10.1007/s42001-025-00435-2

  45. [45]

    Guilong Lu et al. 2024. GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning.Journal of Systems and Software212 (June 2024), 112031. doi:10.1016/j.jss.2024.112031

  46. [46]

    Junyi Lu et al. 2023. LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning. In2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 647–658. doi:10.1109/issre59848.2023.00026

  47. [47]

    Simon Malberg, Roman Poletukhin, Carolin Schuster, and Georg Groh Groh. 2025. A Comprehensive Evaluation of Cognitive Biases in LLMs. InProceedings of the 5th International Conference on Natural Language Processing for Digital Humanities. Association for Computational Linguistics, 578–613. doi:10.18653/v1/2025.nlp4dh- 1.50

  48. [48]

    Mir, Mehdi Keshani, and Sebastian Proksch

    Amir M. Mir, Mehdi Keshani, and Sebastian Proksch. 2023. On the Effect of Transitivity and Granularity on Vulnerability Propagation in the Maven Ecosys- tem. In2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 201–211. doi:10.1109/SANER56733.2023.00028

  49. [49]

    MITRE CVE Program. 2024. CVE-2024-56143: Strapi Allows Unauthorized Access to Private Fields via parms.lookup. https://cve.mitre.org/cgi-bin/cvename.cgi?na me=CVE-2024-56143. Accessed: 2026-03-25

  50. [50]

    Jiwon Moon, Yerin Hwang, Dongryeol Lee, Taegwan Kang, Yongil Kim, and Kyomin Jung. 2026. Don’t Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation. InFindings of the Association for Computational Linguistics: EACL 2026, Vera Demberg, Kentaro Inui, and Lluís Marquez (Eds.). Association for Computational Linguistics, Rabat, Morocco, ...

  51. [51]

    John Naulty, Eason Chen, Joy Wang, George Digkas, and Kostas Chalkias

  52. [52]

    arXiv:2503.17302 [cs.CR] https://arxiv.org/abs/2503.17302

    Bugdar: AI-Augmented Secure Code Review for GitHub Pull Requests. arXiv:2503.17302 [cs.CR] https://arxiv.org/abs/2503.17302

  53. [53]

    Jeremy K. Nguyen. 2024. Human bias in AI models? Anchoring effects and mitiga- tion strategies in large language models.Journal of Behavioral and Experimental Finance43 (Sept. 2024), 100971. doi:10.1016/j.jbef.2024.100971

  54. [54]

    Georgios Nikitopoulos, Konstantina Dritsa, Panos Louridas, and Dimitris Mitropoulos. 2021. CrossVul: a cross-language vulnerability dataset with commit data. InProceedings of the 29th ACM Joint Meeting on European Software Engi- neering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece)(ESEC/FSE 2021). Association for Com...

  55. [55]

    Yu Nong, Mohammed Aldeen, Long Cheng, Hongxin Hu, Feng Chen, and Haipeng Cai. 2024. Chain-of-Thought Prompting of Large Language Models for Discover- ing and Fixing Software Vulnerabilities. arXiv:2402.17230 [cs.CR]

  56. [56]

    Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining Zero-Shot Vulnerability Repair with Large Lan- guage Models. In2023 IEEE Symposium on Security and Privacy (SP). 2339–2356. doi:10.1109/SP46215.2023.10179324

  57. [57]

    Benji Peng, Keyu Chen, Ming Li, Pohsun Feng, Ziqian Bi, Junyu Liu, Xinyuan Song, and Qian Niu. 2025. Securing Large Language Models: Addressing Bias, Misinformation, and Prompt Attacks. arXiv:2409.08087 [cs.CR] https://arxiv.or g/abs/2409.08087

  58. [58]

    Ethan Perez et al. 2023. Discovering Language Model Behaviors with Model- Written Evaluations. InFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 13387–13434. doi:10.18653 /v1/2023.findings-acl.847

  59. [59]

    Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. 2023. Do Users Write More Insecure Code with AI Assistants?. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23). ACM, 2785–2799. doi:10.1145/3576915.3623157

  60. [60]

    Chanathip Pornprasit and Chakkrit Tantithamthavorn. 2024. Fine-tuning and prompt engineering for large language models-based code review automation. Information and Software Technology175 (Nov. 2024), 107523. doi:10.1016/j.infs of.2024.107523

  61. [61]

    Piotr Przymus, Andreas Happe, and Jürgen Cito. 2025. Adversarial Bug Re- ports as a Security Risk in Language Model-Based Automated Program Repair. arXiv:2509.05372 [cs.SE] https://arxiv.org/abs/2509.05372

  62. [62]

    PullFlow. 2025. State of AI Code Review 2025. https://pullflow.com/state-of-ai- code-review-2025. Accessed: 2026-02-02

  63. [63]

    Niklas Risse and Marcel Böhme. 2024. Uncovering the limits of machine learn- ing for automatic vulnerability detection. InProceedings of the 33rd USENIX Conference on Security Symposium(Philadelphia, PA, USA)(SEC ’24). USENIX Association, USA, Article 238, 18 pages

  64. [64]

    Mrinank Sharma et al. 2024. Towards Understanding Sycophancy in Language Models. InInternational Conference on Learning Representations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 110–144

  65. [65]

    Ze Sheng et al . 2025. LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights.Comput. Surveys58, 5 (Nov. 2025), 1–35. doi:10.1145/3769082

  66. [66]

    Jing Kai Siow et al . 2020. CORE: Automating Review Recommendation for Code Changes. In2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 284–295. doi:10.1109/saner48275.202 0.9054794

  67. [67]

    STELLAR: A search- based testing framework for large language model applications

    Lev Sorokin, Ivan Vasilev, Ken E. Friedl, and Andrea Stocco. 2026. STELLAR: A Search-Based Testing Framework for Large Language Model Applications. In SANER ’26. arXiv:2601.00497

  68. [68]

    Diomidis Spinellis. 2010. CScout: A Refactoring Browser for C.Science of Computer Programming75, 4 (April 2010), 216–231. doi:10.1016/j.scico.2009.09. 003

  69. [69]

    Joseph Spracklen et al. 2025. We have a package for you a comprehensive analysis of package hallucinations by code generating LLMs. InProceedings of the 34th USENIX Conference on Security Symposium(Seattle, WA, USA)(SEC ’25). USENIX Association, USA, Article 190, 20 pages

  70. [70]

    Stack Overflow. 2025. 2025 Developer Survey. https://survey.stackoverflow.co/2 025 Accessed: 2026-02-02

  71. [71]

    Benjamin Steenhoek, Md Mahbubur Rahman, Richard Jiles, and Wei Le. 2023. An Empirical Study of Deep Learning Models for Vulnerability Detection. InPro- ceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 2237–2248. doi:10.1109/ICSE48619.20 23.00188

  72. [72]

    Strapi Solutions. 2026. Strapi. https://github.com/strapi/strapi

  73. [73]

    Tao Sun et al. 2025. BitsAI-CR: Automated Code Review via LLM in Practice. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering (FSE Companion ’25). ACM, 274–285. doi:10.1145/3696630. 3728552

  74. [74]

    SWE-bench. 2026. SWE-bench Official Leaderboards. https://www.swebench.c om/. Accessed: 2026

  75. [75]

    Amos Tversky and Daniel Kahneman. 1981. The framing of decisions and the psychology of choice.science211, 4481 (1981), 453–458

  76. [76]

    Miriam Ugarte, Pablo Valle, Jose Antonio Parejo, Sergio Segura, and Aitor Arrieta

  77. [77]

    ASTRAL: Automated Safety Testing of Large Language Models. InProc. AST. 114–124. doi:10.1109/AST66626.2025.00018

  78. [78]

    Miku Watanabe et al. 2024. On the Use of ChatGPT for Code Review: Do De- velopers Like Reviews By ChatGPT?. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024). ACM, 375–380. doi:10.1145/3661167.3661183

  79. [79]

    Ratnadira Widyasari et al. 2026. Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents. InProc. ICSE. To appear

  80. [80]

    Qiushi Wu and Kangjie Lu. 2021. On the Feasibility of Stealthily Introducing Vulnerabilities in Open-Source Software via Hypocrite Commits.University of Minnesota(2021)

Showing first 80 references.