A new queryable binary dataset combining cross-build diversity, temporal history, and CVE labels with linked metadata for vulnerability research.
hub
arXiv preprint arXiv:2403.18624 , year=
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
RealVuln benchmark finds security-specialized scanners outperform general-purpose LLMs and rule-based SAST tools on hand-labeled vulnerable Python code under F3 scoring, with all artifacts released.
Fisher information selects task-relevant parts of graph features to fuse with pretrained code models, improving vulnerability detection F1 by up to 6.3 points on BigVul, Devign, and ReVeal.
ReDef creates a revert-anchored dataset of 3,164 defective and 10,268 clean code modifications and shows that code language models perform better with diff encodings but maintain stable performance under counterfactual perturbations, indicating reliance on superficial cues.
ML4AVD research remains locked into binary function-level classification of C/C++ vulnerabilities because twelve pain points in the pipeline reinforce each other through feedback loops.
Veritas detects memory corruption vulnerabilities in stripped binaries by combining static value-flow slicing, dual-view LLM reasoning, and multi-agent runtime validation, reporting 90% recall, zero false positives on 623 exhaustive cases, and discovery of a real Apple CVE.
Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.
A framework combining universal AST normalization, hybrid graph-LLM embeddings, and strict execution-grounded validation achieves 89-92% intra-language accuracy and 74-80% cross-language F1 while resolving 70% of vulnerabilities at 12% failure rate.
Adding interprocedural context from callers or callees enables LLMs to detect vulnerabilities more effectively, with Gemini 3 Flash achieving F1 scores of at least 0.978 for C at low cost and Claude Haiku 4.5 excelling at explanations.
PoC-Adapt improves automated PoC exploit generation reliability by 25% and lowers cost using semantic state validation and RL adaptive policies, verifying 12 PoCs from 80 recent CVE attempts at $0.42 each.
Fine-tuned decoder-only LLMs fall into a Semantic Trap on vulnerability detection, achieving high scores on unpaired normal code but failing on paired vulnerable-patched code, semantic perturbations, and gap analysis, while reasoning supervision reduces symptoms at the cost of recall.
Student models distilled from code language models often fail to deeply mimic teachers, showing up to 62% behavioral discrepancies and 285% worse drops under attacks that accuracy metrics miss.
XOXO is a cross-origin context poisoning attack on AI coding assistants that uses a Cayley Graph search algorithm (GCGS) to find stealthy perturbations, achieving 75.72% average success rate across five tasks and eleven models.
LCC-LLM creates a code-centric dataset and RAG-based LLM framework that reaches 0.634 average semantic similarity on 43 malware tasks and 10/10 pass rate in real-world case studies.
MultiVul uses multimodal contrastive learning to align code and comment representations, yielding up to 27% F1 gains on vulnerability detection benchmarks over prompting and code-only baselines.
VulWeaver improves Java vulnerability detection to 0.75 F1 by enhancing dependency graphs with LLM semantic fixes, extracting full context from slices plus implicit usage info, and applying type-specific meta-prompting with majority voting.
VulTriage combines control dependency extraction, CWE knowledge retrieval, and semantic summarization to improve LLM accuracy on vulnerability detection, reaching SOTA on PrimeVul and generalizing to Kotlin.
citing papers explorer
-
ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage
A new queryable binary dataset combining cross-build diversity, temporal history, and CVE labels with linked metadata for vulnerability research.
-
BioDefect: The First Dataset for Defect Detection in Bioinformatics Software
BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
-
RealVuln: Benchmarking Rule-Based, General-Purpose LLM, and Security-Specialized Scanners on Real-World Code
RealVuln benchmark finds security-specialized scanners outperform general-purpose LLMs and rule-based SAST tools on hand-labeled vulnerable Python code under F3 scoring, with all artifacts released.
-
Focus on What Matters: Fisher-Guided Adaptive Multimodal Fusion for Vulnerability Detection
Fisher information selects task-relevant parts of graph features to fuse with pretrained code models, improving vulnerability detection F1 by up to 6.3 points on BigVul, Devign, and ReVeal.
-
ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?
ReDef creates a revert-anchored dataset of 3,164 defective and 10,268 clean code modifications and shows that code language models perform better with diff encodings but maintain stable performance under counterfactual perturbations, indicating reliance on superficial cues.
-
Direction for Detection: A Survey of Automated Vulnerability Detection and all of its Pain Points
ML4AVD research remains locked into binary function-level classification of C/C++ vulnerabilities because twelve pain points in the pipeline reinforce each other through feedback loops.
-
Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries
Veritas detects memory corruption vulnerabilities in stripped binaries by combining static value-flow slicing, dual-view LLM reasoning, and multi-agent runtime validation, reporting 90% recall, zero false positives on 623 exhaustive cases, and discovery of a real Apple CVE.
-
Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study
Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.
-
Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis
A framework combining universal AST normalization, hybrid graph-LLM embeddings, and strict execution-grounded validation achieves 89-92% intra-language accuracy and 74-80% cross-language F1 while resolving 70% of vulnerabilities at 12% failure rate.
-
Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs
Adding interprocedural context from callers or callees enables LLMs to detect vulnerabilities more effectively, with Gemini 3 Flash achieving F1 scores of at least 0.978 for C at low cost and Claude Haiku 4.5 excelling at explanations.
-
PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy
PoC-Adapt improves automated PoC exploit generation reliability by 25% and lowers cost using semantic state validation and RL adaptive policies, verifying 12 PoCs from 80 recent CVE attempts at $0.42 each.
-
Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap
Fine-tuned decoder-only LLMs fall into a Semantic Trap on vulnerability detection, achieving high scores on unpaired normal code but failing on paired vulnerable-patched code, semantic perturbations, and gap analysis, while reasoning supervision reduces symptoms at the cost of recall.
-
A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?
Student models distilled from code language models often fail to deeply mimic teachers, showing up to 62% behavioral discrepancies and 285% worse drops under attacks that accuracy metrics miss.
-
XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants
XOXO is a cross-origin context poisoning attack on AI coding assistants that uses a Cayley Graph search algorithm (GCGS) to find stealthy perturbations, achieving 75.72% average success rate across five tasks and eleven models.
-
LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution
LCC-LLM creates a code-centric dataset and RAG-based LLM framework that reaches 0.634 average semantic similarity on 43 malware tasks and 10/10 pass rate in real-world case studies.
-
Learning Generalizable Multimodal Representations for Software Vulnerability Detection
MultiVul uses multimodal contrastive learning to align code and comment representations, yielding up to 27% F1 gains on vulnerability detection benchmarks over prompting and code-only baselines.
-
VulWeaver: Weaving Broken Semantics for Grounded Vulnerability Detection
VulWeaver improves Java vulnerability detection to 0.75 F1 by enhancing dependency graphs with LLM semantic fixes, extracting full context from slices plus implicit usage info, and applying type-specific meta-prompting with majority voting.
-
VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection
VulTriage combines control dependency extraction, CWE knowledge retrieval, and semantic summarization to improve LLM accuracy on vulnerability detection, reaching SOTA on PrimeVul and generalizing to Kotlin.