Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.
hub Canonical reference
Llm-assisted static analysis for detecting security vulnerabilities
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
Merlin generates CodeQL queries from natural language questions via RAG-based iteration and a self-test technique using assistive queries, achieving 3.8x higher task accuracy and 31% less completion time in user studies while finding additional software issues.
CodeQL detected 171 CVEs total, with 83 caught by a prior version before the fix; detections were often actionable within the vulnerable file but not stable across tool versions.
Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes across libraries, C++ standard, and compilers.
CodeCureAgent achieves 96.8% plausible fixes and 86.3% correct fixes for 1,000 SonarQube warnings across 106 Java projects using an agentic LLM framework.
FuzzingBrain V2, a multi-agent LLM system with a novel Suspicious Point abstraction and dual-layer fuzzing, reports 90% detection on a C/C++ benchmark and 29 confirmed zero-day vulnerabilities in real open-source projects.
ReasonVul deploys three LLM agents with independent analysis and structured debate to achieve 40% PairAcc and 72.52% F1 on PrimeVul, outperforming baselines by 81% in PairAcc.
Veritas detects memory corruption vulnerabilities in stripped binaries by combining static value-flow slicing, dual-view LLM reasoning, and multi-agent runtime validation, reporting 90% recall, zero false positives on 623 exhaustive cases, and discovery of a real Apple CVE.
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on expert-labeled samples.
AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.
Fine-tuned decoder-only LLMs fall into a Semantic Trap on vulnerability detection, achieving high scores on unpaired normal code but failing on paired vulnerable-patched code, semantic perturbations, and gap analysis, while reasoning supervision reduces symptoms at the cost of recall.
TRACE improves project-wise subsequent code editing by interleaving neural-based induction for semantic edits and tool-based deduction for syntactic edits.
VulWeaver improves Java vulnerability detection to 0.75 F1 by enhancing dependency graphs with LLM semantic fixes, extracting full context from slices plus implicit usage info, and applying type-specific meta-prompting with majority voting.
Large language models consistently underestimate cybersecurity risks compared to human experts in CIS Controls-based assessments, indicating they should serve as complementary rather than standalone tools.
CyberAId is a proposed on-premise multi-agent system that coordinates LLM subagents with classical security tools to improve threat response and regulatory alignment in financial services.
Systematic survey of 55 studies on security testing identifies structural-adaptive fragmentation between program representations and adaptive mechanisms, proposing a unified research agenda.
Survey mapping LLM applications in software quality assurance to established standards including ISO/IEC 12207, ISO 25010, CMMI, and TMM, with case studies, challenges, and future directions.
citing papers explorer
-
Do Coding Agents Understand Least-Privilege Authorization?
Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.
-
Generating Complex Code Analyzers from Natural Language Questions
Merlin generates CodeQL queries from natural language questions via RAG-based iteration and a self-test technique using assistive queries, achieving 3.8x higher task accuracy and 31% less completion time in user studies while finding additional software issues.
-
Longitudinal Analyses of SAST Tools: A CodeQL Case Study
CodeQL detected 171 CVEs total, with 83 caught by a prior version before the fix; detections were often actionable within the vulnerable file but not stable across tool versions.
-
Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery
Refute-or-Promote applies adversarial multi-agent review with kill gates and empirical verification to filter LLM defect candidates, killing 79-83% before disclosure and yielding 4 CVEs plus multiple accepted fixes across libraries, C++ standard, and compilers.
-
CodeCureAgent: Automatic Classification and Repair of Static Analysis Warnings
CodeCureAgent achieves 96.8% plausible fixes and 86.3% correct fixes for 1,000 SonarQube warnings across 106 Java projects using an agentic LLM framework.
-
FuzzingBrain V2: A Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction
FuzzingBrain V2, a multi-agent LLM system with a novel Suspicious Point abstraction and dual-layer fuzzing, reports 90% detection on a C/C++ benchmark and 29 confirmed zero-day vulnerabilities in real open-source projects.
-
Three Heads Are Better Than One: A Multi-perspective Reasoning Framework for Enhanced Vulnerability Detection
ReasonVul deploys three LLM agents with independent analysis and structured debate to achieve 40% PairAcc and 72.52% F1 on PrimeVul, outperforming baselines by 81% in PairAcc.
-
Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries
Veritas detects memory corruption vulnerabilities in stripped binaries by combining static value-flow slicing, dual-view LLM reasoning, and multi-agent runtime validation, reporting 90% recall, zero false positives on 623 exhaustive cases, and discovery of a real Apple CVE.
-
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on expert-labeled samples.
-
AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection
AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.
-
Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap
Fine-tuned decoder-only LLMs fall into a Semantic Trap on vulnerability detection, achieving high scores on unpaired normal code but failing on paired vulnerable-patched code, semantic perturbations, and gap analysis, while reasoning supervision reduces symptoms at the cost of recall.
-
Learning Project-wise Subsequent Code Edits via Interleaving Neural-based Induction and Tool-based Deduction
TRACE improves project-wise subsequent code editing by interleaving neural-based induction for semantic edits and tool-based deduction for syntactic edits.
-
VulWeaver: Weaving Broken Semantics for Grounded Vulnerability Detection
VulWeaver improves Java vulnerability detection to 0.75 F1 by enhancing dependency graphs with LLM semantic fixes, extracting full context from slices plus implicit usage info, and applying type-specific meta-prompting with majority voting.
-
Evaluating the Reliability of Multiple Large Language Models in Risk Assessment: A CIS Controls Based Approach
Large language models consistently underestimate cybersecurity risks compared to human experts in CIS Controls-based assessments, indicating they should serve as complementary rather than standalone tools.
-
CyberAId: AI-Driven Cybersecurity for Financial Service Providers
CyberAId is a proposed on-premise multi-agent system that coordinates LLM subagents with classical security tools to improve threat response and regulatory alignment in financial services.
-
Adaptive and AI-Augmented Security Testing: A Systematic Survey of Program Analysis, Feedback-Driven Testing, and Hybrid Learning-Based Approaches
Systematic survey of 55 studies on security testing identifies structural-adaptive fragmentation between program representations and adaptive mechanisms, proposing a unified research agenda.
-
A Blueprint for AI-Driven Software Quality: Integrating LLMs with Established Standards
Survey mapping LLM applications in software quality assurance to established standards including ISO/IEC 12207, ISO 25010, CMMI, and TMM, with case studies, challenges, and future directions.
- Finding Memory Leaks in C/C++ Programs via Neuro-Symbolic Augmented Static Analysis