Empirical evaluation of three LLMs finds prevalent overconfidence in insecure code generation, with security calibration outperforming functional calibration but both degrading in repository-level settings.
hub Canonical reference
Large language model for vulnerability detection: Emerging results and future directions
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.
LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.
A systematic analysis of 59 quantum software testing empirical studies reveals highly diverse designs, inconsistent reporting, and open methodological challenges, leading to recommendations for future work.
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
ML4AVD research remains locked into binary function-level classification of C/C++ vulnerabilities because twelve pain points in the pipeline reinforce each other through feedback loops.
Adding interprocedural context from callers or callees enables LLMs to detect vulnerabilities more effectively, with Gemini 3 Flash achieving F1 scores of at least 0.978 for C at low cost and Claude Haiku 4.5 excelling at explanations.
LLM approaches ExArch and ArTEMiS reach F1 scores of 0.86 and 0.81 for architecture entity recognition and traceability, matching or approaching baselines that require manual models.
Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
Frontier LLMs detect up to 63% of web vulnerabilities in WordPress plugins with scoped prompts outperforming open-ended ones, but all show low consistency across runs and miss some baseline issues.
LLM assistance shortens idea-generation periods and reduces creative moments during programming tasks while yielding solutions with comparable idea counts and greater functional correctness.
bLLMs achieve state-of-the-art results on limited and imbalanced SE sentiment datasets even in zero-shot settings, but fine-tuned sLLMs outperform when ample balanced training data is available.
A research roadmap analyzing the current state of search-based software engineering with foundation models, outlining challenges and directions across three integration aspects.
Reproducibility study of Vul-RAG confirms original findings in a fully local open-weights setting but identifies a persistent performance plateau at approximately 0.30 pairwise accuracy across diverse recent open-weight LLMs.
A literature review of Nix's functional package management solutions to software deployment problems alongside the new and unsolved issues it introduces.
citing papers explorer
-
An Empirical Study of Security Calibration in Large Language Models for Code
Empirical evaluation of three LLMs finds prevalent overconfidence in insecure code generation, with security calibration outperforming functional calibration but both degrading in repository-level settings.
-
What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants
An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.
-
Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review
LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.
-
A Methodological Analysis of Empirical Studies in Quantum Software Testing
A systematic analysis of 59 quantum software testing empirical studies reveals highly diverse designs, inconsistent reporting, and open methodological challenges, leading to recommendations for future work.
-
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
-
Direction for Detection: A Survey of Automated Vulnerability Detection and all of its Pain Points
ML4AVD research remains locked into binary function-level classification of C/C++ vulnerabilities because twelve pain points in the pipeline reinforce each other through feedback loops.
-
Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs
Adding interprocedural context from callers or callees enables LLMs to detect vulnerabilities more effectively, with Gemini 3 Flash achieving F1 scores of at least 0.978 for C at low cost and Claude Haiku 4.5 excelling at explanations.
-
Who's Who? LLM-assisted Software Traceability with Architecture Entity Recognition
LLM approaches ExArch and ArTEMiS reach F1 scores of 0.86 and 0.81 for architecture entity recognition and traceability, matching or approaching baselines that require manual models.
-
A Study of LLMs' Preferences for Libraries and Programming Languages
Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
-
Evaluating LLMs for Real-World Web Vulnerability Detection
Frontier LLMs detect up to 63% of web vulnerabilities in WordPress plugins with scoped prompts outperforming open-ended ones, but all show low consistency across runs and miss some baseline issues.
-
"Like Taking the Path of Least Resistance": Exploring the Impact of LLM Interaction on the Creative Process of Programming
LLM assistance shortens idea-generation periods and reduces creative moments during programming tasks while yielding solutions with comparable idea counts and greater functional correctness.
-
Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models
bLLMs achieve state-of-the-art results on limited and imbalanced SE sentiment datasets even in zero-shot settings, but fine-tuned sLLMs outperform when ample balanced training data is available.
-
Search-Based Software Engineering and AI Foundation Models: Current Landscape and Future Roadmap
A research roadmap analyzing the current state of search-based software engineering with foundation models, outlining challenges and directions across three integration aspects.
-
Revisiting Vul-RAG: Reproducibility and Replicability of RAG-based Vulnerability Detection with Open-Weight Models
Reproducibility study of Vul-RAG confirms original findings in a fully local open-weights setting but identifies a persistent performance plateau at approximately 0.30 pairwise accuracy across diverse recent open-weight LLMs.
-
Nix: A Solution With Problems
A literature review of Nix's functional package management solutions to software deployment problems alongside the new and unsolved issues it introduces.