pith. machine review for the scientific record. sign in

arxiv: 2604.21051 · v1 · submitted 2026-04-22 · 💻 cs.SE · cs.CR

Recognition: unknown

Residual Risk Analysis in Benign Code: How Far Are We? A Multi-Model Semantic and Structural Similarity Approach

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:27 UTC · model grok-4.3

classification 💻 cs.SE cs.CR
keywords residualsimilaritycoderiskanalysisbenignsecuritysemantic
0
0 comments X

The pith

Patched functions often remain similar to vulnerable ones, and a new multi-model similarity scoring system identifies residual issues like null pointer dereferences in 61% of high-risk cases from the PrimeVul dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses a gap in software security where patched code is assumed to be safe, but may still have residual risks. Researchers examined pairs of vulnerable and patched functions from the PrimeVul benchmark dataset. They employed multiple code language models to measure semantic similarity through embeddings, which capture the meaning of the code. Additionally, they used Tree-sitter to build abstract syntax trees and analyze structural similarities in the code's organization. These two signals, along with how much the different models agree, are combined into a Residual Risk Scoring system called RRS. This score estimates how likely the patched code still carries security problems. Their analysis revealed that many patched functions are still very similar to the original vulnerable code in both meaning and structure. This similarity suggests that the patch may not have fully addressed the underlying issue. To validate, they looked at the pairs with high RRS scores using popular static analysis tools: Cppcheck, Clang-Tidy, and Facebook-Infer. About 61 percent of these high-risk pairs showed one of 13 different types of residual problems, such as null pointer dereferences or unsafe ways of allocating memory. The study concludes that using code similarity as a signal can help prioritize which patched code needs extra inspection, making security assessments more reliable in large software projects. This approach combines modern AI techniques with traditional code analysis to tackle an underexplored area in vulnerability management.

Core claim

Our analysis shows that benign functions often remain highly similar to their vulnerable counterparts both semantically and structurally, indicating potential persistence of residual risk. We further find that approximately 61% of high-RRS code pairs exhibit 13 distinct categories of residual issues (e.g., null pointer dereferences, unsafe memory allocation), validated using state-of-the-art static analysis tools including Cppcheck, Clang-Tidy, and Facebook-Infer.

Load-bearing premise

That high semantic and structural similarity between vulnerable and patched functions reliably signals the presence of residual security risks detectable and classifiable by static analysis tools like Cppcheck.

Figures

Figures reproduced from arXiv: 2604.21051 by Mohammad Farhad, Shuvalaxmi Dass.

Figure 1
Figure 1. Figure 1: Vulnerable and benign code (left) with corresponding AST subtrees [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Residual risk analysis pipeline combining semantic similarity ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Semantic and structural similarity analysis across vulnerable–benign func [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Semantic vs. structural similarity for vulnerable–benign pairs under dif [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of embedding sim￾ilarity, localized AST similarity, and cross-model agreement across a subset of vulnerable–benign function pairs. (a) α = 0.6, β = 0.2 (b) α = 0.2, β = 0.6 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Dummy helpers used for static analysis [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

Software security relies on effective vulnerability detection and patching, yet determining whether a patch fully eliminates risk remains an underexplored challenge. Existing vulnerability benchmarks often treat patched functions as inherently benign, overlooking the possibility of residual security risks. In this work, we analyze vulnerable-benign function pairs from the PrimeVul, a benchmark dataset using multiple code language models (Code LMs) to capture semantic similarity, complemented by Tree-sitter-based abstract syntax tree (AST) analysis for structural similarity. Building on these signals, we propose Residual Risk Scoring (RRS), a unified framework that integrates embedding-based semantic similarity, localized AST-based structural similarity, and cross-model agreement to estimate residual risk in code. Our analysis shows that benign functions often remain highly similar to their vulnerable counterparts both semantically and structurally, indicating potential persistence of residual risk. We further find that approximately $61\%$ of high-RRS code pairs exhibit $13$ distinct categories of residual issues (e.g., null pointer dereferences, unsafe memory allocation), validated using state-of-the-art static analysis tools including Cppcheck, Clang-Tidy, and Facebook-Infer. These results demonstrate that code-level similarity provides a practical signal for prioritizing post-patch inspection, enabling more reliable and scalable security assessment in real-world open-source software pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes residual security risks in patched functions by measuring semantic similarity with multiple code language models and structural similarity with AST parsing on vulnerable-benign pairs from PrimeVul. It introduces Residual Risk Scoring (RRS) combining these, reports that benign functions are often similar to vulnerable ones, and finds that 61% of high-RRS pairs have 13 categories of issues like null pointer dereferences detected by Cppcheck, Clang-Tidy, and Infer, proposing similarity as a signal for prioritizing inspections.

Significance. If the central result is confirmed with appropriate controls, this work could have significant practical impact by providing an automated, scalable method to detect potential incomplete patches in open-source code using readily available tools. The multi-model approach and validation with static analyzers add credibility to the similarity-based risk estimation.

major comments (2)
  1. [Abstract] The claim that approximately 61% of high-RRS code pairs exhibit 13 distinct categories of residual issues lacks support from a baseline analysis. The static analysis tools were run only on high-RRS pairs, without reporting the rate of findings in low-RRS patched functions or a control sample. This omission means the 61% cannot be distinguished from the background rate at which these tools flag issues in typical C/C++ code, weakening the link between high RRS and residual risk.
  2. [Method] The definition of the Residual Risk Scoring (RRS) formula, including the weighting for cross-model agreement and the threshold for classifying high-RRS, is not fully specified with concrete values or sensitivity analysis. Since these are free parameters, the robustness of the reported 61% figure to different choices is unclear and requires explicit documentation for reproducibility.
minor comments (2)
  1. [Abstract] The abstract mentions 'benign code' but the analysis is on patched functions; consistent terminology would improve clarity.
  2. Consider adding a table summarizing the 13 categories of residual issues with examples for better reader understanding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important aspects for strengthening the manuscript. We address each major comment below and are prepared to make the necessary revisions to improve clarity, reproducibility, and evidential support.

read point-by-point responses
  1. Referee: [Abstract] The claim that approximately 61% of high-RRS code pairs exhibit 13 distinct categories of residual issues lacks support from a baseline analysis. The static analysis tools were run only on high-RRS pairs, without reporting the rate of findings in low-RRS patched functions or a control sample. This omission means the 61% cannot be distinguished from the background rate at which these tools flag issues in typical C/C++ code, weakening the link between high RRS and residual risk.

    Authors: We agree that a baseline analysis is required to isolate the contribution of high RRS from the general rate at which static analyzers flag issues in C/C++ code. In the revised manuscript we will add results from applying Cppcheck, Clang-Tidy, and Infer to (i) the low-RRS patched functions in PrimeVul and (ii) a control sample of unrelated, non-patched C/C++ functions drawn from the same repositories. This will allow direct comparison of finding rates and will clarify whether the observed 61% rate in high-RRS pairs exceeds the background rate. revision: yes

  2. Referee: [Method] The definition of the Residual Risk Scoring (RRS) formula, including the weighting for cross-model agreement and the threshold for classifying high-RRS, is not fully specified with concrete values or sensitivity analysis. Since these are free parameters, the robustness of the reported 61% figure to different choices is unclear and requires explicit documentation for reproducibility.

    Authors: We will revise the Methods section to present the complete RRS formula, including the exact numerical weights assigned to semantic similarity, structural similarity, and cross-model agreement, as well as the numerical threshold used to designate high-RRS pairs. We will also add a sensitivity analysis that varies these weights and the threshold over plausible ranges and reports the resulting variation in the 61% statistic, thereby demonstrating robustness and supporting reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; RRS defined from independent metrics and validated externally

full rationale

The paper constructs RRS from semantic embeddings of pre-trained code LMs and Tree-sitter AST structural similarity, then applies independent static-analysis tools (Cppcheck, Clang-Tidy, Infer) to count issues in the high-RRS subset. No parameters are fitted to the 61% outcome, no self-citation chain supports the central claim, and the derivation does not reduce any result to its own inputs by construction. The approach remains self-contained against external benchmarks and tools.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The claim rests on the domain assumption that code similarity indicates residual risk and introduces RRS as a new scoring construct with likely tunable parameters for combining signals; no machine-checked proofs or external benchmarks are mentioned.

free parameters (2)
  • RRS high-risk threshold
    Used to select the subset of pairs for the 61% residual issue validation; chosen or fitted to produce the reported statistic.
  • Cross-model agreement weighting
    Parameter controlling how model agreement contributes to the overall RRS score.
axioms (2)
  • domain assumption Embedding similarity from code LMs captures security-relevant semantic properties of vulnerabilities
    Invoked when using multiple Code LMs to measure semantic similarity as a risk signal.
  • domain assumption Localized AST structural similarity correlates with residual code flaws
    Basis for the Tree-sitter component of RRS.
invented entities (1)
  • Residual Risk Scoring (RRS) no independent evidence
    purpose: Unified score integrating semantic embeddings, AST structure, and model agreement to estimate residual risk
    Newly defined framework whose validity is not independently verified outside this analysis.

pith-pipeline@v0.9.0 · 5536 in / 1717 out tokens · 61771 ms · 2026-05-09T23:27:34.830801+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 17 canonical work pages · 2 internal anchors

  1. [1]

    Meta.https://www.meta.com/(2004), accessed: 02-12-2026

  2. [2]

    net/(2007), accessed: 02-13-2026

    Cppcheck: A tool for static c/c++ code analysis.http://cppcheck.sourceforge. net/(2007), accessed: 02-13-2026

  3. [3]

    Clang-tidy: Llvm/clang-based static analyzer.https://clang.llvm.org/extra/ clang-tidy/(2013), accessed: 02-10-2026

  4. [4]

    Infer: A static analyzer for java, c, c++, and objective-c.https://fbinfer.com/ (2015), accessed: 02-12-2026

  5. [5]

    In: Proceedings of the 2012 ACM Conference on Com- puter and Communications Security

    Bilge, L., Dumitraş, T.: Before we knew it: an empirical study of zero-day at- tacks in the real world. In: Proceedings of the 2012 ACM Conference on Com- puter and Communications Security. p. 833–844. CCS ’12, Association for Comput- ing Machinery, New York, NY, USA (2012).https://doi.org/10.1145/2382196. 2382284,https://doi.org/10.1145/2382196.2382284

  6. [6]

    ACM Trans

    Böhme, M.: Stads: Software testing as species discovery. ACM Trans. Softw. Eng. Methodol.27(2) (Jun 2018).https://doi.org/10.1145/3210309,https://doi. org/10.1145/3210309

  7. [7]

    In: Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis

    Cheng, X., Zhang, G., Wang, H., Sui, Y.: Path-sensitive code embedding via con- trastive learning for software vulnerability detection. In: Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis. pp. 519–531 (2022)

  8. [8]

    In: Proceed- ings of the 15th ACM/IEEE international symposium on empirical software engi- neering and measurement (ESEM)

    Croft, R., Newlands, D., Chen, Z., Babar, M.A.: An empirical study of rule-based and learning-based approaches for static application security testing. In: Proceed- ings of the 15th ACM/IEEE international symposium on empirical software engi- neering and measurement (ESEM). pp. 1–12 (2021)

  9. [9]

    semanticscholar.org/CorpusID:8242220

    Dijkstra, E.W.: Notes on structured programming (1970),https://api. semanticscholar.org/CorpusID:8242220

  10. [10]

    Ding, Y., Fu, Y., Ibrahim, O., Sitawarin, C., Chen, X., Alomair, B., Wagner, D., Ray, B., Chen, Y.: Vulnerability detection with code language models: How far are we? In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE).pp.1729–1741(2025).https://doi.org/10.1109/ICSE55347.2025.00038

  11. [11]

    In: Proceedings of the 14th Interna- tional Conference on Recent Advances in Natural Language Processing

    Ebrahim, F., Joy, M.: Source code plagiarism detection with pre-trained model embeddings and automated machine learning. In: Proceedings of the 14th Interna- tional Conference on Recent Advances in Natural Language Processing. pp. 301– 309 (2023)

  12. [12]

    In: Proceedings of the 29th ACM/IEEE international conference on Automated software engineering

    Falleri, J.R., Morandat, F., Blanc, X., Martinez, M., Monperrus, M.: Fine-grained and accurate source code differencing. In: Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. pp. 313–324 (2014)

  13. [13]

    HYDRA: A Hybrid Heuristic-Guided Deep Representation Architecture for Predicting Latent Zero-Day Vulnerabilities in Patched Functions

    Farhad, M., Rahman, S., Dass, S.: Hydra: A hybrid heuristic-guided deep rep- resentation architecture for predicting latent zero-day vulnerabilities in patched functions. arXiv preprint arXiv:2511.06220 (2025)

  14. [14]

    Farhad et al

    Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., Zhou, M.: Codebert: A pre-trained model for programming and natural languages (2020) 18 M. Farhad et al

  15. [15]

    In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

    Fu, M., Tantithamthavorn, C., Le, T., Nguyen, V., Phung, D.: Vulrepair: a t5-based automated software vulnerability repair. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. p. 935–947. ESEC/FSE 2022, Association for Comput- ing Machinery, New York, NY, USA (2022).https...

  16. [16]

    Unixcoder: Unified cross-modal pre-training for code representation,

    Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M., Yin, J.: Unixcoder: Unified cross- modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022)

  17. [17]

    GraphCodeBERT: Pre-training Code Representations with Data Flow

    Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou, L., Duan, N., Svy- atkovskiy, A., Fu, S., et al.: Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020)

  18. [18]

    Leveraging rag-enhanced large language model for semi-supervised log anomaly detection

    Han, M., Wang, L., Chang, J., Li, B., Zhang, C.: Learning graph-based patch rep- resentations for identifying and assessing silent vulnerability fixes. In: 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). pp. 120–131 (2024).https://doi.org/10.1109/ISSRE62328.2024.00022

  19. [19]

    Hugging Face, Inc.: Hugging face model hub.https://huggingface.co/, accessed: 11-20-2025

  20. [20]

    In: 2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

    Katz, K., Moshtari, S., Mujhid, I., Mirakhorli, M., Garcia, D.: Siexvults: Sensi- tive information exposure vulnerability detection system using transformer models and static analysis. In: 2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). pp. 230–241. IEEE (2025)

  21. [21]

    arXiv preprint arXiv:2512.18261 (2025)

    Kholoosi, M.M., Le, T.H.M., Babar, M.A.: Software vulnerability management in the era of artificial intelligence: An industry perspective. arXiv preprint arXiv:2512.18261 (2025)

  22. [22]

    Le, T.H.M., Babar, M.A.: Automatic data labeling for software vulnerability pre- diction models: How far are we? In: Proceedings of the 18th ACM/IEEE Inter- national Symposium on Empirical Software Engineering and Measurement. pp. 131–142 (2024)

  23. [23]

    In: Proceedings of the 48th IEEE/ACM International Conference on Software Engineering, ICSE

    Lee, S., Böhme, M.: Dependency-aware residual risk analysis. In: Proceedings of the 48th IEEE/ACM International Conference on Software Engineering, ICSE. vol. 26 (2026)

  24. [24]

    Martinez-Gil,J.:Evaluatingsmall-scalecodemodelsforcodeclonedetection.arXiv preprint arXiv:2506.10995 (2025)

  25. [25]

    In: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

    Nguyen, A.T., Le, T.H.M., Babar, M.A.: Automated code-centric software vulnera- bility assessment: How far are we? an empirical study in c/c++. In: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. pp. 72–83 (2024)

  26. [26]

    arXiv preprint arXiv:2509.09714 (2025)

    Nikiema, S.L., Djire, A.E., Bonkoungou, A.A., Moumoula, M.B., Samhi, J., Ka- bore, A.K., Klein, J., Bissyande, T.F.: How small transformation expose the weak- ness of semantic similarity measures. arXiv preprint arXiv:2509.09714 (2025)

  27. [27]

    Selvaraj, M., Uddin, G.: Does collaborative editing help mitigate security vulnera- bilities in crowd-shared iot code examples? In: Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. pp. 92–102 (2022)

  28. [28]

    Empirical Software Engineering28(6), 135 (2023)

    Shi, E., Wang, Y., Du, L., Zhang, H., Han, S., Zhang, D., Sun, H.: Cocoast: repre- senting source code via hierarchical splitting and reconstruction of abstract syntax trees. Empirical Software Engineering28(6), 135 (2023)

  29. [29]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

    Song, Y., Lothritz, C., Tang, X., Bissyandé, T., Klein, J.: Revisiting code similarity evaluation with abstract syntax tree edit distance. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 38–46 (2024) Residual Risk Analysis in Benign Code: How Far Are We? 19

  30. [30]

    Sun, W., Fang, C., Miao, Y., You, Y., Yuan, M., Chen, Y., Zhang, Q., Guo, A., Chen, X., Liu, Y., et al.: Abstract syntax tree for programming language under- standing and representation: How far are we? arXiv preprint arXiv:2312.00413 (2023)

  31. [31]

    arXiv preprint arXiv:2308.15233 (2023)

    Tang, X., Ezzini, S., Tian, H., Song, Y., Klein, J., Bissyande, T.F., et al.: Multilevel semantic embedding of software patches: a fine-to-coarse grained approach towards security patch detection. arXiv preprint arXiv:2308.15233 (2023)

  32. [32]

    Tree-sitter Contributors: Tree-sitter: A parser generator tool and incremental pars- ing library.https://tree-sitter.github.io/tree-sitter/(2024), accessed: 01- 21-2026

  33. [33]

    In: Proceedings of the 30th IEEE/ACM international conference on program comprehension

    Wang, K., Yan, M., Zhang, H., Hu, H.: Unified abstract syntax tree represen- tation learning for cross-language program classification. In: Proceedings of the 30th IEEE/ACM international conference on program comprehension. pp. 390– 400 (2022)

  34. [34]

    In: 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

    Wang, S., Wen, M., Chen, L., Yi, X., Mao, X.: How different is it between machine- generated and developer-provided patches?: An empirical study on the correct patches generated by automated program repair techniques. In: 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). pp. 1–12. IEEE (2019)

  35. [35]

    In: 44th IEEE Symposium on Security and Privacy, SP 2023, San Francisco, CA, USA, May 21-25, 2023

    Wang, S., Wang, X., Sun, K., Jajodia, S., Wang, H., Li, Q.: GraphSPD: Graph-Based Security Patch Detection with Enriched Code Semantics . In: 2023 IEEE Symposium on Security and Privacy (SP). pp. 2409–2426. IEEE Computer Society, Los Alamitos, CA, USA (May 2023).https://doi.org/ 10.1109/SP46215.2023.10179479,https://doi.ieeecomputersociety.org/10. 1109/SP...

  36. [36]

    arXiv preprint (2023)

    Wang, Y., Le, H., Gotmare, A.D., Bui, N.D., Li, J., Hoi, S.C.H.: Codet5+: Open code large language models for code understanding and generation. arXiv preprint (2023)

  37. [37]

    Wang, Y., Wang, W., Joty, S., Hoi, S.C.H.: Codet5: Identifier-aware unified pre- trained encoder-decoder models for code understanding and generation (2021)

  38. [38]

    arXiv preprint arXiv:2509.19117 (2025)

    Weissberg, F., Pirch, L., Imgrund, E., Möller, J., Eisenhofer, T., Rieck, K.: Llm- based vulnerability discovery through the lens of code metrics. arXiv preprint arXiv:2509.19117 (2025)

  39. [39]

    In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering

    Xie, Z., Wen, M., Wei, Z., Jin, H.: Unveiling the characteristics and impact of security patch evolution. In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. pp. 1094–1106 (2024)

  40. [40]

    arXiv preprint arXiv:2512.20203 (2025)

    Ye, Z., Sun, X., Cao, S., Bo, L., Li, B.: Well begun is half done: Location- aware and trace-guided iterative automated vulnerability repair. arXiv preprint arXiv:2512.20203 (2025)

  41. [41]

    Yi, G., Nong, Y., Li, M., Cai, H.: Exploring and improving real-world vulnerability data generation via prompting large language models (2026)

  42. [42]

    Reliability Engineering & System Safety152, 137–150 (2016).https://doi.org/ https://doi.org/10.1016/j.ress.2016.02.009,https://www.sciencedirect

    Zio, E.: Challenges in the vulnerability and risk analysis of critical infrastructures. Reliability Engineering & System Safety152, 137–150 (2016).https://doi.org/ https://doi.org/10.1016/j.ress.2016.02.009,https://www.sciencedirect. com/science/article/pii/S0951832016000508 A Extended Results 20 M. Farhad et al. Fig.5:Visualizationofembeddingsim- ilarity...