Empirical evaluation of three LLMs finds prevalent overconfidence in insecure code generation, with security calibration outperforming functional calibration but both degrading in repository-level settings.
Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
A new queryable binary dataset combining cross-build diversity, temporal history, and CVE labels with linked metadata for vulnerability research.
Creates LogicDS with 122 logical vulnerabilities and LogicEval framework to evaluate repair techniques, finding failures mainly from prompt sensitivity, lost code context, and poor patch localization.
XOXO is a cross-origin context poisoning attack on AI coding assistants that uses a Cayley Graph search algorithm (GCGS) to find stealthy perturbations, achieving 75.72% average success rate across five tasks and eleven models.
citing papers explorer
-
An Empirical Study of Security Calibration in Large Language Models for Code
Empirical evaluation of three LLMs finds prevalent overconfidence in insecure code generation, with security calibration outperforming functional calibration but both degrading in repository-level settings.
-
ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage
A new queryable binary dataset combining cross-build diversity, temporal history, and CVE labels with linked metadata for vulnerability research.
-
LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software
Creates LogicDS with 122 logical vulnerabilities and LogicEval framework to evaluate repair techniques, finding failures mainly from prompt sensitivity, lost code context, and poor patch localization.
-
XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants
XOXO is a cross-origin context poisoning attack on AI coding assistants that uses a Cayley Graph search algorithm (GCGS) to find stealthy perturbations, achieving 75.72% average success rate across five tasks and eleven models.