Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review
Pith reviewed 2026-05-15 08:56 UTC · model grok-4.3
The pith
LLM-assisted code review can be fooled 100% of the time by iteratively refined pull request metadata.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-based automated code review is vulnerable to contextual bias from PR metadata framing. A novel iterative refinement attack, which refines prompts against a cloned pipeline, achieves 100% success in causing the model to miss real vulnerabilities across 17 CVEs in 10 projects, while template attacks are ineffective. The success stems from an asymmetry where attackers iterate freely but defenders have only one detection opportunity. Metadata redaction and explicit instructions restore accurate detection in all affected cases.
What carries the argument
The framing effect, where the presentation of PR metadata overrides its semantic content in forming LLM security judgments, exploited through contextual-bias injection.
If this is right
- Template-based attacks often backfire by raising reviewer suspicions.
- The iterative attack succeeds every time due to the attacker iteration advantage.
- Debiasing through metadata removal and explicit instructions eliminates the evasion in tested cases.
- Over-reliance on ACR without human oversight creates security risks in the development process.
Where Pith is reading between the lines
- The same attacker-defender asymmetry may apply to other LLM security tasks that allow prompt iteration.
- Production pipelines should default to stripping untrusted metadata before LLM review.
- Contributor reputation and human review become more important safeguards when automated checks can be bypassed.
Load-bearing premise
The controlled tests using 17 CVEs across 10 projects and six LLMs reflect how real production ACR pipelines respond to varied workflows and configurations.
What would settle it
Applying the iterative refinement attack to a live production ACR system and observing whether it evades detection at the same rate as in the controlled experiments.
Figures
read the original abstract
Automated Code Review (ACR) systems integrating Large Language Models (LLMs) are increasingly adopted in software development workflows, ranging from interactive assistants to autonomous agents in CI/CD pipelines. In this paper, we study how LLM-based vulnerability detection in ACR is affected by the framing effect: the tendency to let the presentation of information override its semantic content in forming judgments. We examine whether adversaries can exploit this through contextual-bias injection: crafting PR metadata to bias ACR security judgments as a supply-chain attack vector against real-world ACR pipelines. To this end, we first conduct a large-scale exploratory study across 6 LLMs under five framing conditions, establishing the framing effect as a systematic and widespread phenomenon in LLM-based vulnerability detection, with bug-free framing producing the strongest effect. We then design a realistic and controlled experimental environment, evaluating 17 CVEs across 10 real-world projects, to assess the susceptibility of real-world ACR pipelines to vulnerability reintroduction attacks. We employ two attack strategies: a template-based attack inspired by prior related work, and a novel LLM-assisted iterative refinement attack. We find that template-based attacks are ineffective and may even backfire, as direct biasing attempts raise suspicions. Our iterative refinement attack, on the other hand, achieves 100% success, exploiting a fundamental asymmetry: attackers can iteratively refine attacks against a local clone of the review pipeline, while defenders have only one chance to detect them. Debiasing via metadata redaction and explicit instructions restores detection in all affected cases. Overall, our findings highlight the dangers of over-relying on ACR and stress the importance of human oversight and contributor trust in the development process.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines the framing effect in LLM-based automated code review (ACR) for security vulnerabilities. Through a large-scale exploratory study across 6 LLMs under five framing conditions, it shows that contextual framing—particularly bug-free framing—systematically biases vulnerability detection. It then evaluates susceptibility in a controlled setup with 17 CVEs across 10 real-world projects using two attack strategies: template-based attacks (found ineffective or counterproductive) and a novel LLM-assisted iterative refinement attack (achieving 100% success). The work highlights an asymmetry where attackers can refine against a local clone of the pipeline while defenders have limited detection opportunities, and demonstrates that debiasing via metadata redaction and explicit instructions restores detection.
Significance. If the results hold, the findings are significant for highlighting a practical supply-chain attack vector against LLM-integrated ACR systems. The large-scale empirical measurements across multiple models and real CVEs provide concrete evidence of the framing effect and its exploitability, underscoring risks of over-reliance on automated security reviews and the value of human oversight. The controlled evaluation on actual projects and the identification of effective debiasing methods offer actionable implications for secure development workflows.
major comments (1)
- Iterative refinement attack and asymmetry claim: The 100% success rate and the central claim of a 'fundamental asymmetry' (attackers iteratively refine against a local clone while defenders have one chance) is load-bearing for the supply-chain attack argument. However, all experiments use a fully controlled environment with known LLMs and fixed prompts across the 17 CVEs. No tests assess transfer when the clone differs from the target (e.g., model version, temperature, system prompt, or additional filters common in production CI/CD), which directly challenges whether the asymmetry holds outside the lab setup.
minor comments (2)
- Abstract: Limited detail is provided on exact success metrics, statistical tests performed, or potential confounds in the iterative attack refinement process.
- Experimental description: Clarify the number of independent runs, variance, or significance testing supporting the 100% success rate to strengthen replicability.
Simulated Author's Rebuttal
We thank the referee for the detailed and insightful comments on our work. We address the major comment below, acknowledging the controlled nature of our experiments while defending the core claims based on the evidence presented. We have made partial revisions to strengthen the discussion of limitations and generalizability.
read point-by-point responses
-
Referee: Iterative refinement attack and asymmetry claim: The 100% success rate and the central claim of a 'fundamental asymmetry' (attackers iteratively refine against a local clone while defenders have one chance) is load-bearing for the supply-chain attack argument. However, all experiments use a fully controlled environment with known LLMs and fixed prompts across the 17 CVEs. No tests assess transfer when the clone differs from the target (e.g., model version, temperature, system prompt, or additional filters common in production CI/CD), which directly challenges whether the asymmetry holds outside the lab setup.
Authors: We agree that our evaluation was performed in a controlled environment using known LLMs and fixed prompts, which enabled precise measurement of the 100% success rate for the iterative refinement attack across the 17 CVEs. This setup serves as a realistic proxy for many ACR pipelines that rely on standard model configurations. The asymmetry claim is grounded in the fundamental difference in access: an attacker can perform unlimited local iterations against a cloned pipeline (using publicly available models and prompts), while a defender processes each incoming PR only once without prior knowledge of the attack. We did not empirically test transferability across mismatched configurations such as model versions, temperature settings, or production filters, which is a valid limitation. However, the attack's iterative nature allows adaptation to approximate target behaviors, and the supply-chain risk remains relevant for pipelines using similar open models. We have revised the manuscript to add an explicit limitations subsection discussing generalizability to production environments and to qualify the asymmetry claim accordingly. revision: partial
Circularity Check
No significant circularity: empirical results from controlled experiments
full rationale
The paper reports empirical findings from a large-scale study across 6 LLMs and controlled experiments on 17 CVEs in 10 real-world projects. The 100% success of the iterative refinement attack is presented as a direct experimental outcome in a fully specified local clone setup, not as a derived prediction from fitted parameters or self-referential definitions. The 'fundamental asymmetry' is a qualitative description of attacker/defender capabilities in that setup rather than a mathematical claim that reduces to inputs by construction. No equations, uniqueness theorems, or ansatzes are invoked that would trigger the enumerated circularity patterns. Self-citations (if present) are not load-bearing for the central empirical claims, which remain falsifiable against the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs exhibit framing effects when processing code review tasks
Forward citations
Cited by 1 Pith paper
-
Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks
Dual-mode benchmarks reveal frontier LLMs have high false positives and low vulnerability coverage in cybersecurity tasks while domain-specialized models reach over 50% per-family detection and 0.904 precision, indica...
Reference graph
Works this paper leans on
-
[1]
Nadia Alshahwan, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang. 2024. Assured Offline LLM-Based Software Engineer- ing. InProc. InteNSE. 7–12. doi:10.1145/3643661.3643953
-
[2]
Anthropics. 2026. claude-code-action GitHub repository. https://github.com/ant hropics/claude-code-action. Accessed: 2026
work page 2026
-
[3]
Anthropics. 2026. Claude Code documentation. https://code.claude.com/docs/e n/overview. Accessed: 2026
work page 2026
- [4]
-
[5]
Riccardo Cantini, Alessio Orsino, Massimo Ruggiero, and Domenico Talia. 2025. Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge.Machine Learning 114 (2025), 249. doi:10.1007/s10994-025-06862-6
-
[6]
Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2022. Deep Learning Based Vulnerability Detection: Are We There Yet?IEEE Transac- tions on Software Engineering48, 9 (2022), 3280–3296. doi:10.1109/TSE.2021.308 7402
-
[7]
Siduo Chen. 2025. Cognitive Biases in Large Language Model based Decision Mak- ing: Insights and Mitigation Strategies.Applied and Computational Engineering 138, 1 (March 2025), 167–174. doi:10.54254/2755-2721/2025.21389
-
[8]
Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. 2025. SecAlign: Defending Against Prompt Injection with Preference Optimization. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security(Taipei, Taiwan)(CCS ’25). Association for Computing Machinery, New York, NY, USA, 2833...
-
[9]
Wei Chen et al. 2025. From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning. doi:10.48550/arXiv.2409.01658 Pre-print on arXiv
-
[10]
Yujia Chen. 2025. AutoReview: An LLM-based Multi-Agent System for Security Issue-Oriented Code Review. InProceedings of the 33rd ACM International Con- ference on the Foundations of Software Engineering (FSE Companion ’25). ACM, 1022–1024. doi:10.1145/3696630.3728618
-
[11]
Vanessa Cheung, Maximilian Maier, and Falk Lieder. 2025. Large language models show amplified cognitive biases in moral decision-making.Proceedings of the National Academy of Sciences122, 25 (2025), e2412015122
work page 2025
-
[12]
Md Atique Reza Chowdhury et al. 2022. On the Untriviality of Trivial Packages: An Empirical Study of npm JavaScript Packages.IEEE Transactions on Software Engineering48, 8 (Aug. 2022), 2695–2708. doi:10.1109/tse.2021.3068901
-
[13]
CodeRabbit. 2025. CodeRabbit: AI Code Reviews. https://www.coderabbit.ai/ Accessed: 2026-02-02
work page 2025
-
[14]
Roland Croft, M. Ali Babar, and M. Mehdi Kholoosi. 2023. Data Quality for Soft- ware Vulnerability Datasets. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 121–133. doi:10.1109/ICSE48619.2023.00022
-
[15]
Andreas Dann, Henrik Plate, Ben Hermann, Serena Elisa Ponta, and Eric Bodden
-
[16]
Identifying Challenges for OSS Vulnerability Scanners — A Study & Test Suite.IEEE Transactions on Software Engineering48, 9 (Sept. 2022), 3613–3625. doi:10.1109/tse.2021.3101739
-
[17]
Yangruibo Ding et al. 2025. Vulnerability Detection with Code Language Models: How Far Are We?. InProceedings of the IEEE/ACM 47th International Conference on Software Engineering(Ottawa, Ontario, Canada)(ICSE ’25). IEEE Press, 1729–1741. doi:10.1109/ICSE55347.2025.00038
-
[18]
Qingxiu Dong et al. 2024. A Survey on In-context Learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1107–1128. doi:10.18653/v1/2024.emnlp-main.64
-
[19]
Georgios-Petros Drosos, Thodoris Sotiropoulos, Diomidis Spinellis, and Dimitris Mitropoulos. 2024. Bloat beneath Python’s Scales: A Fine-Grained Inter-Project Dependency Analysis.Proc. ACM Softw. Eng.1, FSE, Article 114 (July 2024), 24 pages. doi:10.1145/3660821
-
[20]
Samuele D’Avenia and Valerio Basile. 2025. Quantifying the Influence of Irrelevant Contexts on Political Opinions Produced by LLMs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). Association for Computational Linguistics, 434–454. doi:10 .18653/v1/2025.acl-srw.28
work page 2025
-
[21]
Daniel E. O’Leary. 2025. An Anchoring Effect in Large Language Models.IEEE Intelligent Systems40, 2 (2025), 23–26. doi:10.1109/MIS.2025.3544939
-
[22]
Aaron Fanous et al. 2025. SycEval: Evaluating LLM Sycophancy.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8, 1 (Oct. 2025), 893–900. doi:10.1609/aies.v8i1.36598
-
[23]
Andres Freund. 2024. Backdoor in XZ Utils. Public disclosure and technical analysis. https://www.openwall.com/lists/oss-security/2024/03/29/4
work page 2024
-
[24]
Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: a transformer- based line-level vulnerability prediction. InProceedings of the 19th Interna- tional Conference on Mining Software Repositories (MSR ’22). ACM, 608–620. doi:10.1145/3524842.3528452
-
[25]
Federico Germani and Giovanni Spitale. 2025. Source framing triggers systematic bias in large language models.Science Advances11, 45 (Nov. 2025). doi:10.1126/ sciadv.adz2924
work page 2025
-
[26]
GitHub. 2024. How GitHub Copilot Works. https://docs.github.com/en/copilot/o verview-of-github-copilot/about-github-copilot. Accessed: 2025
work page 2024
-
[27]
GitHub. 2026. GitHub Actions. https://github.com/features/actions. Accessed: 2026
work page 2026
-
[28]
Greptile. 2026. Greptile: The AI Code Reviewer. https://www.greptile.com/ Accessed: 2026-02-02
work page 2026
-
[29]
Anshul Gupta and Neel Sundaresan. 2018. Intelligent code reviews using deep learning. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18) Deep Learning Day
work page 2018
-
[30]
Hazim Hanif and Sergio Maffeis. 2022. VulBERTa: Simplified Source Code Pre- Training for Vulnerability Detection. In2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8. doi:10.1109/ijcnn55064.2022.9892280
-
[31]
Jingxuan He and Martin Vechev. 2023. Large Language Models for Code: Security Hardening and Adversarial Testing. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security(Copenhagen, Denmark) (CCS ’23). Association for Computing Machinery, New York, NY, USA, 1865–1879. doi:10.1145/3576915.3623175
-
[32]
Joseph Hejderup, Arie van Deursen, and Georgios Gousios. 2018. Software ecosystem call graph for dependency management. InProceedings of the 40th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE ’18). ACM, 101–104. doi:10.1145/3183399.3183417
-
[33]
Jellyfish. 2025. 2025 AI Metrics in Review: What 12 Months of Data Tell Us About Adoption and Impact. https://jellyfish.co/blog/2025-ai-metrics-in-review/ Accessed: 2026-02-02
work page 2025
-
[34]
Haolin Jin and Huaming Chen. 2025. Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 3819–3823. doi:10.1109/ASE63991.2025.00323
-
[35]
Sungwon Kim and Daniel Khashabi. 2025. Challenging the Evaluator: LLM Sycophancy Under User Rebuttal. InFindings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, Suzhou, China, 22461–22478. https://aclanthology.org/2025.findings-emnlp.1222/
work page 2025
-
[36]
Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom. 2020. Spectre attacks: exploiting speculative execution. Commun. ACM63, 7 (June 2020), 93–101. doi:10.1145/3399742
-
[37]
Ahmed Lekssays, Hamza Mouhcine, Khang Tran, Ting Yu, and Issa Khalil. 2025. LLMxCPG: Context-Aware vulnerability detection through code property Graph- Guided large language models. In34th USENIX Security Symposium (USENIX Security 25). 489–507
work page 2025
-
[38]
Lvxue Li et al. 2024. Debiasing In-Context Learning by Instructing LLMs How to Follow Demonstrations. InFindings of the Association for Computational Linguistics ACL 2024. Association for Computational Linguistics, 7203–7215. doi:10.18653/v1/2024.findings-acl.430
-
[39]
Zhiyu Li et al. 2022. Automating code review activities by large-scale pre-training. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE ’22). ACM, 1035–1047. doi:10.1145/3540250.3549081
-
[40]
Jie Lin and David Mohaisen. 2025. From Large to Mammoth: A Comparative Evaluation of Large Language Models in Zero-Shot Vulnerability Detection. In Proceedings 2025 Network and Distributed System Security Symposium (NDSS 2025). Internet Society. doi:10.14722/ndss.2025.241491
-
[41]
Mario Lins, René Mayrhofer, and Michael Roland. 2025.Unveiling the Critical At- tack Path for Implanting Backdoors in Supply Chains: Practical Experience from XZ. , , D. Mitropoulos, N. Alexopoulos, G. Alexopoulos, and D. Spinellis Springer Nature Singapore, 521–541. doi:10.1007/978-981-95-4434-9_24
-
[42]
Yang Liu, Armstrong Foundjem, Foutse Khomh, and Heng Li. 2025. Adversarial Attack Classification and Robustness Testing for Large Language Models for Code. Empirical Software Engineering30, 5 (2025). doi:10.1007/s10664-025-10693-3
-
[43]
Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadel- phia, PA, 1831–1847. https://www.usenix.org/conference/usenixsecurity24/pre sentation/liu-yupei
work page 2024
-
[44]
Jiaxu Lou and Yifan Sun. 2025. Anchoring bias in large language models: an experimental study.Journal of Computational Social Science9, 11 (Dec. 2025). doi:10.1007/s42001-025-00435-2
-
[45]
Guilong Lu et al. 2024. GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning.Journal of Systems and Software212 (June 2024), 112031. doi:10.1016/j.jss.2024.112031
-
[46]
Junyi Lu et al. 2023. LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning. In2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 647–658. doi:10.1109/issre59848.2023.00026
-
[47]
Simon Malberg, Roman Poletukhin, Carolin Schuster, and Georg Groh Groh. 2025. A Comprehensive Evaluation of Cognitive Biases in LLMs. InProceedings of the 5th International Conference on Natural Language Processing for Digital Humanities. Association for Computational Linguistics, 578–613. doi:10.18653/v1/2025.nlp4dh- 1.50
-
[48]
Mir, Mehdi Keshani, and Sebastian Proksch
Amir M. Mir, Mehdi Keshani, and Sebastian Proksch. 2023. On the Effect of Transitivity and Granularity on Vulnerability Propagation in the Maven Ecosys- tem. In2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 201–211. doi:10.1109/SANER56733.2023.00028
-
[49]
MITRE CVE Program. 2024. CVE-2024-56143: Strapi Allows Unauthorized Access to Private Fields via parms.lookup. https://cve.mitre.org/cgi-bin/cvename.cgi?na me=CVE-2024-56143. Accessed: 2026-03-25
work page 2024
-
[50]
Jiwon Moon, Yerin Hwang, Dongryeol Lee, Taegwan Kang, Yongil Kim, and Kyomin Jung. 2026. Don’t Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation. InFindings of the Association for Computational Linguistics: EACL 2026, Vera Demberg, Kentaro Inui, and Lluís Marquez (Eds.). Association for Computational Linguistics, Rabat, Morocco, ...
work page 2026
-
[51]
John Naulty, Eason Chen, Joy Wang, George Digkas, and Kostas Chalkias
-
[52]
arXiv:2503.17302 [cs.CR] https://arxiv.org/abs/2503.17302
Bugdar: AI-Augmented Secure Code Review for GitHub Pull Requests. arXiv:2503.17302 [cs.CR] https://arxiv.org/abs/2503.17302
-
[53]
Jeremy K. Nguyen. 2024. Human bias in AI models? Anchoring effects and mitiga- tion strategies in large language models.Journal of Behavioral and Experimental Finance43 (Sept. 2024), 100971. doi:10.1016/j.jbef.2024.100971
-
[54]
Georgios Nikitopoulos, Konstantina Dritsa, Panos Louridas, and Dimitris Mitropoulos. 2021. CrossVul: a cross-language vulnerability dataset with commit data. InProceedings of the 29th ACM Joint Meeting on European Software Engi- neering Conference and Symposium on the Foundations of Software Engineering (Athens, Greece)(ESEC/FSE 2021). Association for Com...
- [55]
-
[56]
Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining Zero-Shot Vulnerability Repair with Large Lan- guage Models. In2023 IEEE Symposium on Security and Privacy (SP). 2339–2356. doi:10.1109/SP46215.2023.10179324
- [57]
-
[58]
Ethan Perez et al. 2023. Discovering Language Model Behaviors with Model- Written Evaluations. InFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 13387–13434. doi:10.18653 /v1/2023.findings-acl.847
work page 2023
-
[59]
Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. 2023. Do Users Write More Insecure Code with AI Assistants?. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23). ACM, 2785–2799. doi:10.1145/3576915.3623157
-
[60]
Chanathip Pornprasit and Chakkrit Tantithamthavorn. 2024. Fine-tuning and prompt engineering for large language models-based code review automation. Information and Software Technology175 (Nov. 2024), 107523. doi:10.1016/j.infs of.2024.107523
- [61]
-
[62]
PullFlow. 2025. State of AI Code Review 2025. https://pullflow.com/state-of-ai- code-review-2025. Accessed: 2026-02-02
work page 2025
-
[63]
Niklas Risse and Marcel Böhme. 2024. Uncovering the limits of machine learn- ing for automatic vulnerability detection. InProceedings of the 33rd USENIX Conference on Security Symposium(Philadelphia, PA, USA)(SEC ’24). USENIX Association, USA, Article 238, 18 pages
work page 2024
-
[64]
Mrinank Sharma et al. 2024. Towards Understanding Sycophancy in Language Models. InInternational Conference on Learning Representations, B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun (Eds.), Vol. 2024. 110–144
work page 2024
-
[65]
Ze Sheng et al . 2025. LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights.Comput. Surveys58, 5 (Nov. 2025), 1–35. doi:10.1145/3769082
-
[66]
Jing Kai Siow et al . 2020. CORE: Automating Review Recommendation for Code Changes. In2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 284–295. doi:10.1109/saner48275.202 0.9054794
-
[67]
STELLAR: A search- based testing framework for large language model applications
Lev Sorokin, Ivan Vasilev, Ken E. Friedl, and Andrea Stocco. 2026. STELLAR: A Search-Based Testing Framework for Large Language Model Applications. In SANER ’26. arXiv:2601.00497
-
[68]
Diomidis Spinellis. 2010. CScout: A Refactoring Browser for C.Science of Computer Programming75, 4 (April 2010), 216–231. doi:10.1016/j.scico.2009.09. 003
-
[69]
Joseph Spracklen et al. 2025. We have a package for you a comprehensive analysis of package hallucinations by code generating LLMs. InProceedings of the 34th USENIX Conference on Security Symposium(Seattle, WA, USA)(SEC ’25). USENIX Association, USA, Article 190, 20 pages
work page 2025
-
[70]
Stack Overflow. 2025. 2025 Developer Survey. https://survey.stackoverflow.co/2 025 Accessed: 2026-02-02
work page 2025
-
[71]
Benjamin Steenhoek, Md Mahbubur Rahman, Richard Jiles, and Wei Le. 2023. An Empirical Study of Deep Learning Models for Vulnerability Detection. InPro- ceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 2237–2248. doi:10.1109/ICSE48619.20 23.00188
-
[72]
Strapi Solutions. 2026. Strapi. https://github.com/strapi/strapi
work page 2026
-
[73]
Tao Sun et al. 2025. BitsAI-CR: Automated Code Review via LLM in Practice. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering (FSE Companion ’25). ACM, 274–285. doi:10.1145/3696630. 3728552
-
[74]
SWE-bench. 2026. SWE-bench Official Leaderboards. https://www.swebench.c om/. Accessed: 2026
work page 2026
-
[75]
Amos Tversky and Daniel Kahneman. 1981. The framing of decisions and the psychology of choice.science211, 4481 (1981), 453–458
work page 1981
-
[76]
Miriam Ugarte, Pablo Valle, Jose Antonio Parejo, Sergio Segura, and Aitor Arrieta
-
[77]
ASTRAL: Automated Safety Testing of Large Language Models. InProc. AST. 114–124. doi:10.1109/AST66626.2025.00018
-
[78]
Miku Watanabe et al. 2024. On the Use of ChatGPT for Code Review: Do De- velopers Like Reviews By ChatGPT?. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024). ACM, 375–380. doi:10.1145/3661167.3661183
-
[79]
Ratnadira Widyasari et al. 2026. Let the Trial Begin: A Mock-Court Approach to Vulnerability Detection using LLM-Based Agents. InProc. ICSE. To appear
work page 2026
-
[80]
Qiushi Wu and Kangjie Lu. 2021. On the Feasibility of Stealthily Introducing Vulnerabilities in Open-Source Software via Hypocrite Commits.University of Minnesota(2021)
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.