hub

Vulnerability detection with code language models: How far are we?

· 2024 · arXiv 2403.18624

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage

cs.CR · 2026-05-20 · unverdicted · novelty 7.0

A new queryable binary dataset combining cross-build diversity, temporal history, and CVE labels with linked metadata for vulnerability research.

BioDefect: The First Dataset for Defect Detection in Bioinformatics Software

cs.SE · 2026-05-20 · unverdicted · novelty 7.0

BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.

RealVuln: Benchmarking Rule-Based, General-Purpose LLM, and Security-Specialized Scanners on Real-World Code

cs.CR · 2026-04-15 · unverdicted · novelty 7.0

RealVuln benchmark finds security-specialized scanners outperform general-purpose LLMs and rule-based SAST tools on hand-labeled vulnerable Python code under F3 scoring, with all artifacts released.

Focus on What Matters: Fisher-Guided Adaptive Multimodal Fusion for Vulnerability Detection

cs.SE · 2026-01-05 · unverdicted · novelty 7.0

Fisher information selects task-relevant parts of graph features to fuse with pretrained code models, improving vulnerability detection F1 by up to 6.3 points on BigVul, Devign, and ReVeal.

ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?

cs.SE · 2025-09-11 · unverdicted · novelty 7.0

ReDef creates a revert-anchored dataset of 3,164 defective and 10,268 clean code modifications and shows that code language models perform better with diff encodings but maintain stable performance under counterfactual perturbations, indicating reliance on superficial cues.

Direction for Detection: A Survey of Automated Vulnerability Detection and all of its Pain Points

cs.SE · 2024-12-15 · conditional · novelty 7.0

ML4AVD research remains locked into binary function-level classification of C/C++ vulnerabilities because twelve pain points in the pipeline reinforce each other through feedback loops.

Veritas: A Semantically Grounded Agentic Framework for Memory Corruption Vulnerability Detection in Binaries

cs.SE · 2026-05-14 · unverdicted · novelty 6.0

Veritas detects memory corruption vulnerabilities in stripped binaries by combining static value-flow slicing, dual-view LLM reasoning, and multi-agent runtime validation, reporting 90% recall, zero false positives on 623 exhaustive cases, and discovery of a real Apple CVE.

Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study

cs.SE · 2026-05-13 · accept · novelty 6.0

Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.

Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis

cs.SE · 2026-04-12 · unverdicted · novelty 6.0

A framework combining universal AST normalization, hybrid graph-LLM embeddings, and strict execution-grounded validation achieves 89-92% intra-language accuracy and 74-80% cross-language F1 while resolving 70% of vulnerabilities at 12% failure rate.

Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs

cs.SE · 2026-04-09 · unverdicted · novelty 6.0

Adding interprocedural context from callers or callees enables LLMs to detect vulnerabilities more effectively, with Gemini 3 Flash achieving F1 scores of at least 0.978 for C at low cost and Claude Haiku 4.5 excelling at explanations.

PoC-Adapt: Semantic-Aware Automated Vulnerability Reproduction with LLM Multi-Agents and Reinforcement Learning-Driven Adaptive Policy

cs.CR · 2026-04-08 · unverdicted · novelty 6.0

PoC-Adapt improves automated PoC exploit generation reliability by 25% and lowers cost using semantic state validation and RL adaptive policies, verifying 12 PoCs from 80 recent CVE attempts at $0.42 each.

Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap

cs.CR · 2026-01-30 · unverdicted · novelty 6.0

Fine-tuned decoder-only LLMs fall into a Semantic Trap on vulnerability detection, achieving high scores on unpaired normal code but failing on paired vulnerable-patched code, semantic perturbations, and gap analysis, while reasoning supervision reduces symptoms at the cost of recall.

A Metamorphic Testing Perspective on Knowledge Distillation for Language Models of Code: Does the Student Deeply Mimic the Teacher?

cs.SE · 2025-11-07 · unverdicted · novelty 6.0

Student models distilled from code language models often fail to deeply mimic teachers, showing up to 62% behavioral discrepancies and 285% worse drops under attacks that accuracy metrics miss.

XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants

cs.CR · 2025-03-18 · unverdicted · novelty 6.0

XOXO is a cross-origin context poisoning attack on AI coding assistants that uses a Cayley Graph search algorithm (GCGS) to find stealthy perturbations, achieving 75.72% average success rate across five tasks and eleven models.

LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution

cs.CR · 2026-05-07 · unverdicted · novelty 5.0

LCC-LLM creates a code-centric dataset and RAG-based LLM framework that reaches 0.634 average semantic similarity on 43 malware tasks and 10/10 pass rate in real-world case studies.

Learning Generalizable Multimodal Representations for Software Vulnerability Detection

cs.SE · 2026-04-28 · unverdicted · novelty 5.0

MultiVul uses multimodal contrastive learning to align code and comment representations, yielding up to 27% F1 gains on vulnerability detection benchmarks over prompting and code-only baselines.

VulWeaver: Weaving Broken Semantics for Grounded Vulnerability Detection

cs.SE · 2026-04-12 · unverdicted · novelty 5.0

VulWeaver improves Java vulnerability detection to 0.75 F1 by enhancing dependency graphs with LLM semantic fixes, extracting full context from slices plus implicit usage info, and applying type-specific meta-prompting with majority voting.

VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection

cs.AI · 2026-05-10 · conditional · novelty 4.0 · 2 refs

VulTriage combines control dependency extraction, CWE knowledge retrieval, and semantic summarization to improve LLM accuracy on vulnerability detection, reaching SOTA on PrimeVul and generalizing to Kotlin.

citing papers explorer

Showing 1 of 1 citing paper after filters.

VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection cs.AI · 2026-05-10 · conditional · none · ref 1 · 2 links
VulTriage combines control dependency extraction, CWE knowledge retrieval, and semantic summarization to improve LLM accuracy on vulnerability detection, reaching SOTA on PrimeVul and generalizing to Kotlin.

Vulnerability detection with code language models: How far are we?

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer