Recognition: unknown
Residual Risk Analysis in Benign Code: How Far Are We? A Multi-Model Semantic and Structural Similarity Approach
Pith reviewed 2026-05-09 23:27 UTC · model grok-4.3
The pith
Patched functions often remain similar to vulnerable ones, and a new multi-model similarity scoring system identifies residual issues like null pointer dereferences in 61% of high-risk cases from the PrimeVul dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our analysis shows that benign functions often remain highly similar to their vulnerable counterparts both semantically and structurally, indicating potential persistence of residual risk. We further find that approximately 61% of high-RRS code pairs exhibit 13 distinct categories of residual issues (e.g., null pointer dereferences, unsafe memory allocation), validated using state-of-the-art static analysis tools including Cppcheck, Clang-Tidy, and Facebook-Infer.
Load-bearing premise
That high semantic and structural similarity between vulnerable and patched functions reliably signals the presence of residual security risks detectable and classifiable by static analysis tools like Cppcheck.
Figures
read the original abstract
Software security relies on effective vulnerability detection and patching, yet determining whether a patch fully eliminates risk remains an underexplored challenge. Existing vulnerability benchmarks often treat patched functions as inherently benign, overlooking the possibility of residual security risks. In this work, we analyze vulnerable-benign function pairs from the PrimeVul, a benchmark dataset using multiple code language models (Code LMs) to capture semantic similarity, complemented by Tree-sitter-based abstract syntax tree (AST) analysis for structural similarity. Building on these signals, we propose Residual Risk Scoring (RRS), a unified framework that integrates embedding-based semantic similarity, localized AST-based structural similarity, and cross-model agreement to estimate residual risk in code. Our analysis shows that benign functions often remain highly similar to their vulnerable counterparts both semantically and structurally, indicating potential persistence of residual risk. We further find that approximately $61\%$ of high-RRS code pairs exhibit $13$ distinct categories of residual issues (e.g., null pointer dereferences, unsafe memory allocation), validated using state-of-the-art static analysis tools including Cppcheck, Clang-Tidy, and Facebook-Infer. These results demonstrate that code-level similarity provides a practical signal for prioritizing post-patch inspection, enabling more reliable and scalable security assessment in real-world open-source software pipelines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes residual security risks in patched functions by measuring semantic similarity with multiple code language models and structural similarity with AST parsing on vulnerable-benign pairs from PrimeVul. It introduces Residual Risk Scoring (RRS) combining these, reports that benign functions are often similar to vulnerable ones, and finds that 61% of high-RRS pairs have 13 categories of issues like null pointer dereferences detected by Cppcheck, Clang-Tidy, and Infer, proposing similarity as a signal for prioritizing inspections.
Significance. If the central result is confirmed with appropriate controls, this work could have significant practical impact by providing an automated, scalable method to detect potential incomplete patches in open-source code using readily available tools. The multi-model approach and validation with static analyzers add credibility to the similarity-based risk estimation.
major comments (2)
- [Abstract] The claim that approximately 61% of high-RRS code pairs exhibit 13 distinct categories of residual issues lacks support from a baseline analysis. The static analysis tools were run only on high-RRS pairs, without reporting the rate of findings in low-RRS patched functions or a control sample. This omission means the 61% cannot be distinguished from the background rate at which these tools flag issues in typical C/C++ code, weakening the link between high RRS and residual risk.
- [Method] The definition of the Residual Risk Scoring (RRS) formula, including the weighting for cross-model agreement and the threshold for classifying high-RRS, is not fully specified with concrete values or sensitivity analysis. Since these are free parameters, the robustness of the reported 61% figure to different choices is unclear and requires explicit documentation for reproducibility.
minor comments (2)
- [Abstract] The abstract mentions 'benign code' but the analysis is on patched functions; consistent terminology would improve clarity.
- Consider adding a table summarizing the 13 categories of residual issues with examples for better reader understanding.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which highlight important aspects for strengthening the manuscript. We address each major comment below and are prepared to make the necessary revisions to improve clarity, reproducibility, and evidential support.
read point-by-point responses
-
Referee: [Abstract] The claim that approximately 61% of high-RRS code pairs exhibit 13 distinct categories of residual issues lacks support from a baseline analysis. The static analysis tools were run only on high-RRS pairs, without reporting the rate of findings in low-RRS patched functions or a control sample. This omission means the 61% cannot be distinguished from the background rate at which these tools flag issues in typical C/C++ code, weakening the link between high RRS and residual risk.
Authors: We agree that a baseline analysis is required to isolate the contribution of high RRS from the general rate at which static analyzers flag issues in C/C++ code. In the revised manuscript we will add results from applying Cppcheck, Clang-Tidy, and Infer to (i) the low-RRS patched functions in PrimeVul and (ii) a control sample of unrelated, non-patched C/C++ functions drawn from the same repositories. This will allow direct comparison of finding rates and will clarify whether the observed 61% rate in high-RRS pairs exceeds the background rate. revision: yes
-
Referee: [Method] The definition of the Residual Risk Scoring (RRS) formula, including the weighting for cross-model agreement and the threshold for classifying high-RRS, is not fully specified with concrete values or sensitivity analysis. Since these are free parameters, the robustness of the reported 61% figure to different choices is unclear and requires explicit documentation for reproducibility.
Authors: We will revise the Methods section to present the complete RRS formula, including the exact numerical weights assigned to semantic similarity, structural similarity, and cross-model agreement, as well as the numerical threshold used to designate high-RRS pairs. We will also add a sensitivity analysis that varies these weights and the threshold over plausible ranges and reports the resulting variation in the 61% statistic, thereby demonstrating robustness and supporting reproducibility. revision: yes
Circularity Check
No significant circularity; RRS defined from independent metrics and validated externally
full rationale
The paper constructs RRS from semantic embeddings of pre-trained code LMs and Tree-sitter AST structural similarity, then applies independent static-analysis tools (Cppcheck, Clang-Tidy, Infer) to count issues in the high-RRS subset. No parameters are fitted to the 61% outcome, no self-citation chain supports the central claim, and the derivation does not reduce any result to its own inputs by construction. The approach remains self-contained against external benchmarks and tools.
Axiom & Free-Parameter Ledger
free parameters (2)
- RRS high-risk threshold
- Cross-model agreement weighting
axioms (2)
- domain assumption Embedding similarity from code LMs captures security-relevant semantic properties of vulnerabilities
- domain assumption Localized AST structural similarity correlates with residual code flaws
invented entities (1)
-
Residual Risk Scoring (RRS)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Meta.https://www.meta.com/(2004), accessed: 02-12-2026
2004
-
[2]
net/(2007), accessed: 02-13-2026
Cppcheck: A tool for static c/c++ code analysis.http://cppcheck.sourceforge. net/(2007), accessed: 02-13-2026
2007
-
[3]
Clang-tidy: Llvm/clang-based static analyzer.https://clang.llvm.org/extra/ clang-tidy/(2013), accessed: 02-10-2026
2013
-
[4]
Infer: A static analyzer for java, c, c++, and objective-c.https://fbinfer.com/ (2015), accessed: 02-12-2026
2015
-
[5]
In: Proceedings of the 2012 ACM Conference on Com- puter and Communications Security
Bilge, L., Dumitraş, T.: Before we knew it: an empirical study of zero-day at- tacks in the real world. In: Proceedings of the 2012 ACM Conference on Com- puter and Communications Security. p. 833–844. CCS ’12, Association for Comput- ing Machinery, New York, NY, USA (2012).https://doi.org/10.1145/2382196. 2382284,https://doi.org/10.1145/2382196.2382284
-
[6]
Böhme, M.: Stads: Software testing as species discovery. ACM Trans. Softw. Eng. Methodol.27(2) (Jun 2018).https://doi.org/10.1145/3210309,https://doi. org/10.1145/3210309
-
[7]
In: Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis
Cheng, X., Zhang, G., Wang, H., Sui, Y.: Path-sensitive code embedding via con- trastive learning for software vulnerability detection. In: Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis. pp. 519–531 (2022)
2022
-
[8]
In: Proceed- ings of the 15th ACM/IEEE international symposium on empirical software engi- neering and measurement (ESEM)
Croft, R., Newlands, D., Chen, Z., Babar, M.A.: An empirical study of rule-based and learning-based approaches for static application security testing. In: Proceed- ings of the 15th ACM/IEEE international symposium on empirical software engi- neering and measurement (ESEM). pp. 1–12 (2021)
2021
-
[9]
semanticscholar.org/CorpusID:8242220
Dijkstra, E.W.: Notes on structured programming (1970),https://api. semanticscholar.org/CorpusID:8242220
1970
-
[10]
Ding, Y., Fu, Y., Ibrahim, O., Sitawarin, C., Chen, X., Alomair, B., Wagner, D., Ray, B., Chen, Y.: Vulnerability detection with code language models: How far are we? In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE).pp.1729–1741(2025).https://doi.org/10.1109/ICSE55347.2025.00038
-
[11]
In: Proceedings of the 14th Interna- tional Conference on Recent Advances in Natural Language Processing
Ebrahim, F., Joy, M.: Source code plagiarism detection with pre-trained model embeddings and automated machine learning. In: Proceedings of the 14th Interna- tional Conference on Recent Advances in Natural Language Processing. pp. 301– 309 (2023)
2023
-
[12]
In: Proceedings of the 29th ACM/IEEE international conference on Automated software engineering
Falleri, J.R., Morandat, F., Blanc, X., Martinez, M., Monperrus, M.: Fine-grained and accurate source code differencing. In: Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. pp. 313–324 (2014)
2014
-
[13]
Farhad, M., Rahman, S., Dass, S.: Hydra: A hybrid heuristic-guided deep rep- resentation architecture for predicting latent zero-day vulnerabilities in patched functions. arXiv preprint arXiv:2511.06220 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Farhad et al
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., Zhou, M.: Codebert: A pre-trained model for programming and natural languages (2020) 18 M. Farhad et al
2020
-
[15]
Fu, M., Tantithamthavorn, C., Le, T., Nguyen, V., Phung, D.: Vulrepair: a t5-based automated software vulnerability repair. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. p. 935–947. ESEC/FSE 2022, Association for Comput- ing Machinery, New York, NY, USA (2022).https...
-
[16]
Unixcoder: Unified cross-modal pre-training for code representation,
Guo, D., Lu, S., Duan, N., Wang, Y., Zhou, M., Yin, J.: Unixcoder: Unified cross- modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022)
-
[17]
GraphCodeBERT: Pre-training Code Representations with Data Flow
Guo, D., Ren, S., Lu, S., Feng, Z., Tang, D., Liu, S., Zhou, L., Duan, N., Svy- atkovskiy, A., Fu, S., et al.: Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020)
work page internal anchor Pith review arXiv 2009
-
[18]
Leveraging rag-enhanced large language model for semi-supervised log anomaly detection
Han, M., Wang, L., Chang, J., Li, B., Zhang, C.: Learning graph-based patch rep- resentations for identifying and assessing silent vulnerability fixes. In: 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). pp. 120–131 (2024).https://doi.org/10.1109/ISSRE62328.2024.00022
-
[19]
Hugging Face, Inc.: Hugging face model hub.https://huggingface.co/, accessed: 11-20-2025
2025
-
[20]
In: 2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)
Katz, K., Moshtari, S., Mujhid, I., Mirakhorli, M., Garcia, D.: Siexvults: Sensi- tive information exposure vulnerability detection system using transformer models and static analysis. In: 2025 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). pp. 230–241. IEEE (2025)
2025
-
[21]
arXiv preprint arXiv:2512.18261 (2025)
Kholoosi, M.M., Le, T.H.M., Babar, M.A.: Software vulnerability management in the era of artificial intelligence: An industry perspective. arXiv preprint arXiv:2512.18261 (2025)
-
[22]
Le, T.H.M., Babar, M.A.: Automatic data labeling for software vulnerability pre- diction models: How far are we? In: Proceedings of the 18th ACM/IEEE Inter- national Symposium on Empirical Software Engineering and Measurement. pp. 131–142 (2024)
2024
-
[23]
In: Proceedings of the 48th IEEE/ACM International Conference on Software Engineering, ICSE
Lee, S., Böhme, M.: Dependency-aware residual risk analysis. In: Proceedings of the 48th IEEE/ACM International Conference on Software Engineering, ICSE. vol. 26 (2026)
2026
- [24]
-
[25]
In: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement
Nguyen, A.T., Le, T.H.M., Babar, M.A.: Automated code-centric software vulnera- bility assessment: How far are we? an empirical study in c/c++. In: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. pp. 72–83 (2024)
2024
-
[26]
arXiv preprint arXiv:2509.09714 (2025)
Nikiema, S.L., Djire, A.E., Bonkoungou, A.A., Moumoula, M.B., Samhi, J., Ka- bore, A.K., Klein, J., Bissyande, T.F.: How small transformation expose the weak- ness of semantic similarity measures. arXiv preprint arXiv:2509.09714 (2025)
-
[27]
Selvaraj, M., Uddin, G.: Does collaborative editing help mitigate security vulnera- bilities in crowd-shared iot code examples? In: Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. pp. 92–102 (2022)
2022
-
[28]
Empirical Software Engineering28(6), 135 (2023)
Shi, E., Wang, Y., Du, L., Zhang, H., Han, S., Zhang, D., Sun, H.: Cocoast: repre- senting source code via hierarchical splitting and reconstruction of abstract syntax trees. Empirical Software Engineering28(6), 135 (2023)
2023
-
[29]
In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Song, Y., Lothritz, C., Tang, X., Bissyandé, T., Klein, J.: Revisiting code similarity evaluation with abstract syntax tree edit distance. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 38–46 (2024) Residual Risk Analysis in Benign Code: How Far Are We? 19
2024
- [30]
-
[31]
arXiv preprint arXiv:2308.15233 (2023)
Tang, X., Ezzini, S., Tian, H., Song, Y., Klein, J., Bissyande, T.F., et al.: Multilevel semantic embedding of software patches: a fine-to-coarse grained approach towards security patch detection. arXiv preprint arXiv:2308.15233 (2023)
-
[32]
Tree-sitter Contributors: Tree-sitter: A parser generator tool and incremental pars- ing library.https://tree-sitter.github.io/tree-sitter/(2024), accessed: 01- 21-2026
2024
-
[33]
In: Proceedings of the 30th IEEE/ACM international conference on program comprehension
Wang, K., Yan, M., Zhang, H., Hu, H.: Unified abstract syntax tree represen- tation learning for cross-language program classification. In: Proceedings of the 30th IEEE/ACM international conference on program comprehension. pp. 390– 400 (2022)
2022
-
[34]
In: 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)
Wang, S., Wen, M., Chen, L., Yi, X., Mao, X.: How different is it between machine- generated and developer-provided patches?: An empirical study on the correct patches generated by automated program repair techniques. In: 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). pp. 1–12. IEEE (2019)
2019
-
[35]
In: 44th IEEE Symposium on Security and Privacy, SP 2023, San Francisco, CA, USA, May 21-25, 2023
Wang, S., Wang, X., Sun, K., Jajodia, S., Wang, H., Li, Q.: GraphSPD: Graph-Based Security Patch Detection with Enriched Code Semantics . In: 2023 IEEE Symposium on Security and Privacy (SP). pp. 2409–2426. IEEE Computer Society, Los Alamitos, CA, USA (May 2023).https://doi.org/ 10.1109/SP46215.2023.10179479,https://doi.ieeecomputersociety.org/10. 1109/SP...
-
[36]
arXiv preprint (2023)
Wang, Y., Le, H., Gotmare, A.D., Bui, N.D., Li, J., Hoi, S.C.H.: Codet5+: Open code large language models for code understanding and generation. arXiv preprint (2023)
2023
-
[37]
Wang, Y., Wang, W., Joty, S., Hoi, S.C.H.: Codet5: Identifier-aware unified pre- trained encoder-decoder models for code understanding and generation (2021)
2021
-
[38]
arXiv preprint arXiv:2509.19117 (2025)
Weissberg, F., Pirch, L., Imgrund, E., Möller, J., Eisenhofer, T., Rieck, K.: Llm- based vulnerability discovery through the lens of code metrics. arXiv preprint arXiv:2509.19117 (2025)
-
[39]
In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering
Xie, Z., Wen, M., Wei, Z., Jin, H.: Unveiling the characteristics and impact of security patch evolution. In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. pp. 1094–1106 (2024)
2024
-
[40]
arXiv preprint arXiv:2512.20203 (2025)
Ye, Z., Sun, X., Cao, S., Bo, L., Li, B.: Well begun is half done: Location- aware and trace-guided iterative automated vulnerability repair. arXiv preprint arXiv:2512.20203 (2025)
-
[41]
Yi, G., Nong, Y., Li, M., Cai, H.: Exploring and improving real-world vulnerability data generation via prompting large language models (2026)
2026
-
[42]
Zio, E.: Challenges in the vulnerability and risk analysis of critical infrastructures. Reliability Engineering & System Safety152, 137–150 (2016).https://doi.org/ https://doi.org/10.1016/j.ress.2016.02.009,https://www.sciencedirect. com/science/article/pii/S0951832016000508 A Extended Results 20 M. Farhad et al. Fig.5:Visualizationofembeddingsim- ilarity...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.