pith. sign in

arxiv: 2605.04260 · v1 · submitted 2026-05-05 · 💻 cs.CR · cs.SE

Lightweight Vulnerability Detection from Code Metrics and Token Features

Pith reviewed 2026-05-08 17:21 UTC · model grok-4.3

classification 💻 cs.CR cs.SE
keywords vulnerability detectionC/C++ codetoken n-gramscode metricsTF-IDFlogistic regressionDevign datasetcross-project evaluation
0
0 comments X

The pith

Sparse token n-grams from raw function text plus a few code metrics let logistic regression rank vulnerable C/C++ functions at useful levels on random splits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a lightweight triage method that extracts TF-IDF weighted token n-grams directly from function source code and augments them with five simple metrics such as lines of code and cyclomatic complexity. It trains a class-weighted logistic regression classifier on these features and measures performance on the Devign dataset using precision-recall AUC and recall at a fixed low false-positive rate. The combined model reaches PR-AUC 0.642 and Recall@10% 0.161 when functions are randomly split, but performance falls to roughly PR-AUC 0.436 in cross-project settings. These numbers establish that the simple features can serve as a fast, transparent baseline for initial human review while also revealing limits in generalization across codebases.

Core claim

The central claim is that TF-IDF token n-grams taken from raw function text, when combined with the metrics NLOC, approximate cyclomatic complexity, token count, maximum brace depth and parameter count, allow a class-weighted logistic regression model to achieve PR-AUC 0.642 on random splits of the Devign dataset and thereby provide a reproducible, non-deep-learning baseline for function-level vulnerability ranking.

What carries the argument

The sparse TF-IDF representation of token n-grams extracted from raw function text, joined with the five listed code metrics and fed to class-weighted logistic regression.

Load-bearing premise

The features drawn from raw text and metrics actually reflect vulnerability properties that are not just project-specific lexical habits.

What would settle it

If systematically renaming every identifier in the test functions causes PR-AUC to fall near the random baseline, the claim that the features capture general vulnerability signals would be falsified.

read the original abstract

Vulnerability detection for C/C++ code increasingly relies on heavy representations such as code graphs and deep models, while many practical workflows still benefit from fast and reproducible ranking baselines for human triage. This preprint studies a lightweight function-level vulnerability triage pipeline that combines sparse token n-grams from raw function text with a small set of inexpensive code metrics, including NLOC, approximate cyclomatic complexity, token count, maximum brace depth, and parameter count. We use TF-IDF token features and a class-weighted logistic regression classifier, avoiding deep learning, transformers, and program graphs. Using the Devign function-level labels, we evaluate random and cross-project settings, including a FFmpeg-to-QEMU transfer experiment. We emphasize precision-recall AUC and Recall@10% as ranking-oriented metrics for skewed or triage-oriented workloads. On the random split, the best combined variant reaches PR-AUC 0.642 and Recall@10% 0.161, while cross-project generalization is substantially harder, with PR-AUC around 0.436. We further report ablations, test-only identifier-renaming robustness, and end-to-end efficiency. The results suggest that simple token and metric features provide a useful transparent baseline, but also expose sensitivity to superficial lexical cues and limited cross-project transfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a lightweight function-level vulnerability triage pipeline for C/C++ code that extracts sparse TF-IDF token n-grams from raw function text combined with inexpensive metrics (NLOC, approximate cyclomatic complexity, token count, max brace depth, parameter count) and trains a class-weighted logistic regression classifier. It evaluates this approach on the Devign dataset using random splits and cross-project transfer (e.g., FFmpeg to QEMU), emphasizing PR-AUC and Recall@10% metrics, and reports ablations, identifier-renaming robustness tests, and efficiency measurements. The best combined model achieves PR-AUC 0.642 and Recall@10% 0.161 on random splits but drops to ~0.436 PR-AUC in cross-project settings, positioning the method as a transparent, reproducible baseline that avoids deep learning or graph representations.

Significance. If the results hold, the work supplies a fast, interpretable, and easily reproducible baseline for vulnerability ranking that can be directly compared against heavier models in practical triage workflows. Strengths include the use of public dataset labels, explicit ablations, identifier-renaming robustness checks, and efficiency reporting, which together allow clear assessment of when simple lexical-plus-metric features suffice versus when they fail to transfer.

major comments (2)
  1. [Abstract / Evaluation] Abstract and evaluation results: the reported drop from PR-AUC 0.642 (random split) to ~0.436 (FFmpeg-to-QEMU cross-project) is load-bearing for the central claim that the features provide a 'useful transparent baseline' for real triage. Real-world deployment routinely encounters unseen projects, so this magnitude of degradation indicates the token n-grams and metrics largely encode project-specific lexical patterns rather than transferable vulnerability signals; the manuscript must either qualify the utility claim substantially or supply additional cross-project experiments (e.g., more source-target pairs, project-stratified analysis) to support it.
  2. [Evaluation] Evaluation section: the cross-project protocol description lacks detail on whether training and test functions are strictly disjoint at the project level, how class imbalance is handled across projects, and whether any project-level covariates (e.g., coding style, library usage) were controlled; without these, it is difficult to rule out confounds that could exaggerate or mask the observed generalization gap.
minor comments (2)
  1. The manuscript should provide exact hyper-parameter values for TF-IDF (vocabulary size, n-gram range, min/max document frequency) and the logistic regression (regularization strength, solver) so that the baseline is fully reproducible from the text alone.
  2. Figure and table captions would benefit from explicit statements of which variant (token-only, metric-only, combined) is shown and whether results are averaged over multiple random seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and appropriately qualify our claims.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and evaluation results: the reported drop from PR-AUC 0.642 (random split) to ~0.436 (FFmpeg-to-QEMU cross-project) is load-bearing for the central claim that the features provide a 'useful transparent baseline' for real triage. Real-world deployment routinely encounters unseen projects, so this magnitude of degradation indicates the token n-grams and metrics largely encode project-specific lexical patterns rather than transferable vulnerability signals; the manuscript must either qualify the utility claim substantially or supply additional cross-project experiments (e.g., more source-target pairs, project-stratified analysis) to support it.

    Authors: We agree that the performance drop from 0.642 to ~0.436 PR-AUC is significant and indicates that the features capture a substantial amount of project-specific lexical information rather than purely transferable vulnerability signals. This aligns with our existing discussion of sensitivity to superficial cues. In the revised manuscript we have substantially qualified the central claim in the abstract, introduction, and conclusion: the approach is now presented as a useful transparent baseline primarily for within-project or similar-project triage settings where project-specific training data can be obtained, while explicitly noting the limited cross-project transfer as a key limitation. We have not added further cross-project pairs in this revision cycle, as the current experiments already demonstrate the generalization gap; the qualification addresses the referee's concern without overstating applicability to fully unseen projects. revision: partial

  2. Referee: [Evaluation] Evaluation section: the cross-project protocol description lacks detail on whether training and test functions are strictly disjoint at the project level, how class imbalance is handled across projects, and whether any project-level covariates (e.g., coding style, library usage) were controlled; without these, it is difficult to rule out confounds that could exaggerate or mask the observed generalization gap.

    Authors: We thank the referee for highlighting the need for greater protocol transparency. In the revised manuscript we have added a new subsection in the Evaluation section that explicitly states: (1) training and test functions are strictly disjoint at the project level with no shared functions between source and target projects; (2) class imbalance is addressed consistently by computing class weights from the training set and applying them in the logistic regression for every experiment, including cross-project transfers; and (3) while project-level covariates such as coding style and library usage are not explicitly controlled (as is standard in cross-project benchmarks), we discuss them as inherent aspects of the generalization task and note that the observed gap reflects realistic deployment conditions rather than artifacts of the protocol. These additions allow readers to better assess potential confounds. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical ML pipeline with external labels and held-out evaluation

full rationale

The paper describes a standard supervised ML pipeline: TF-IDF token n-grams plus hand-crafted code metrics fed to class-weighted logistic regression, trained and evaluated on the external Devign dataset using random splits and cross-project transfers. No first-principles derivation, uniqueness theorem, or ansatz is invoked; performance numbers (PR-AUC, Recall@10%) are obtained by fitting on training folds and scoring on disjoint test data. Ablations and identifier-renaming tests are likewise direct empirical measurements. Because every reported quantity is computed from held-out labels rather than being algebraically entailed by the input features or prior self-citations, the evaluation chain contains no self-definitional, fitted-input-renamed-as-prediction, or self-citation-load-bearing steps.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The method depends on standard supervised learning assumptions and the quality of the external dataset; no new entities are introduced.

free parameters (3)
  • TF-IDF parameters
    Parameters for term frequency-inverse document frequency vectorization fitted from training data.
  • Logistic regression coefficients
    Model weights learned from the training set to predict vulnerability labels.
  • Class weights
    Weights to handle class imbalance in the dataset.
axioms (2)
  • domain assumption Devign dataset provides reliable function-level vulnerability labels
    The evaluation relies on these labels as ground truth for both training and testing.
  • domain assumption Random and cross-project splits appropriately test generalization
    Used to assess in-distribution and out-of-distribution performance.

pith-pipeline@v0.9.0 · 5514 in / 1404 out tokens · 68925 ms · 2026-05-08T17:21:02.759773+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks,

    Z. Zhou et al., "Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks," NeurIPS, 2019, arXiv:1909.03496

  2. [2]

    The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets,

    T. Saito and M. Rehmsmeier, "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets," PLoS ONE, vol. 10, no. 3, 2015

  3. [3]

    Predicting Vulnerable Software Components,

    S. Neuhaus et al., "Predicting Vulnerable Software Components," in Proc. ACM CCS, 2007

  4. [4]

    Predicting Vulnerable Software Components via Text Mining,

    R. Scandariato et al., "Predicting Vulnerable Software Components via Text Mining," IEEE Trans. Softw. Eng., vol. 40, no. 10, 2014

  5. [5]

    The Relationship Between Precision-Recall and ROC Curves,

    J. Davis and M. Goadrich, "The Relationship Between Precision-Recall and ROC Curves," in Proc. ICML, 2006

  6. [6]

    Learning from Imbalanced Data,

    H. He and E. A. Garcia, "Learning from Imbalanced Data," IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, 2009

  7. [7]

    Static Analysis for Security,

    B. Chess and G. McGraw, "Static Analysis for Security," IEEE Security & Privacy, vol. 2, no. 6, 2004

  8. [8]

    ITS4: A Static Vulnerability Scanner for C and C++ Code,

    J. Viega et al., "ITS4: A Static Vulnerability Scanner for C and C++ Code," in Proc. ACSAC, 2000

  9. [9]

    Flawfinder,

    D. A. Wheeler, "Flawfinder," 2001. [Online]. Available: https://dwheeler.com/flawfinder/. Accessed: May 2026

  10. [10]

    Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista,

    T. Zimmermann et al., "Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista," in Proc. IEEE ICST, 2010

  11. [11]

    Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities,

    Y. Shin et al., "Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities," IEEE Trans. Softw. Eng., vol. 37, no. 6, 2011

  12. [12]

    Is Complexity Really the Enemy of Software Security?

    Y. Shin and L. Williams, "Is Complexity Really the Enemy of Software Security?" in Proc. ACM QoP, 2008

  13. [13]

    Prioritizing Software Security Fortification through Code-Level Metrics,

    M. Gegick et al., "Prioritizing Software Security Fortification through Code-Level Metrics," in Proc. ACM QoP, 2008

  14. [14]

    VulDeePecker: A Deep Learning-Based System for Vulnerability Detection,

    Z. Li et al., "VulDeePecker: A Deep Learning-Based System for Vulnerability Detection," in Proc. NDSS, 2018

  15. [15]

    SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities,

    Z. Li et al., "SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities," IEEE Trans. Dependable Secure Comput., vol. 19, no. 4, pp. 2244-2258, 2022, doi:10.1109/TDSC.2021.3051525

  16. [16]

    VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery,

    S. Woo et al., "VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery," in Proc. IEEE S&P, 2017

  17. [17]

    VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assist Code Audits,

    H. Perl et al., "VCCFinder: Finding Potential Vulnerabilities in Open-Source Projects to Assist Code Audits," in Proc. ACM CCS, 2015

  18. [18]

    Predicting Vulnerable Components: Software Metrics vs. Text Mining,

    J. Walden et al., "Predicting Vulnerable Components: Software Metrics vs. Text Mining," in Proc. IEEE ISSRE, 2014

  19. [19]

    To Fear or Not to Fear That is the Question: Code Characteristics of a Vulnerable Function with an Existing Exploit,

    A. A. Younis et al., "To Fear or Not to Fear That is the Question: Code Characteristics of a Vulnerable Function with an Existing Exploit," in Proc. ACM CODASPY, 2016

  20. [20]

    Term-Weighting Approaches in Automatic Text Retrieval,

    G. Salton and C. Buckley, "Term-Weighting Approaches in Automatic Text Retrieval," J. Amer. Soc. Inf. Sci., vol. 39, no. 5, 1988

  21. [21]

    Scikit-learn: Machine Learning in Python,

    F. Pedregosa et al., "Scikit-learn: Machine Learning in Python," J. Mach. Learn. Res., vol. 12, 2011