MAS-SZZ: Multi-Agentic SZZ Algorithm for Vulnerability-Inducing Commit Identification
Pith reviewed 2026-05-08 02:54 UTC · model grok-4.3
The pith
MAS-SZZ uses multi-agent collaboration to summarize root causes and localize vulnerable statements for more accurate backtracking to vulnerability-inducing commits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MAS-SZZ identifies vulnerability-inducing commits by having agents collaborate on two steps: they summarize the root cause from the given CVE description and fixing commit, then use structured step-forward prompting to localize vulnerability-related statements from the change intent of each patch hunk; those statements serve as anchors for autonomous backward tracing through the repository history to the commit that first introduced the vulnerability.
What carries the argument
Multi-agent system that summarizes the root cause and applies structured step-forward prompting to localize vulnerability-related statements from patch hunks, which then act as anchors for historical backtracking.
If this is right
- Supplies a stronger foundation for downstream security tasks such as vulnerability detection and affected-version analysis.
- Delivers F1-score gains of up to 65.22 percent over the best-performing prior SZZ algorithm across multiple datasets and languages.
- Addresses the specific failures of incorrect anchor selection and inadequate backtracking that limited V-SZZ and LLM4SZZ.
- Enables autonomous tracing from patch-derived anchors without manual intervention.
Where Pith is reading between the lines
- The same agent-collaboration pattern could be tested on identifying the origins of non-vulnerability bugs or other code-change types.
- Performance may vary on vulnerabilities lacking detailed CVE descriptions, since the method depends on those descriptions for root-cause summarization.
- Replacing the current agents with larger or fine-tuned models might increase localization accuracy but would require separate validation.
- The approach could be combined with static-analysis tools to cross-check the localized statements before backtracking begins.
Load-bearing premise
The multi-agent system can reliably summarize the root cause from the CVE and fixing commit and correctly localize vulnerability-related statements from patch hunks without introducing errors that invalidate the subsequent backtracking.
What would settle it
A hand-labeled sample of CVEs in which the agents' summarized root cause or localized statements are shown to be incorrect, or a new dataset on which MAS-SZZ fails to produce F1 gains over the best prior SZZ method.
Figures
read the original abstract
Accurate vulnerability-inducing commit identification serves as a foundation for a series of software security tasks, such as vulnerability detection and affected version analysis. A straightforward solution is the SZZ algorithm, which traces back through the code history to identify the earliest commit that modify the vulnerable code. Unfortunately, neither the customized V-SZZ nor state-of-the-art LLM4SZZ perform satisfactorily due to the incorrect anchor selection and inadequate backtracking capability, making them far beyond a reliable usage in practice. To overcome these challenges, we propose a multi-agentic SZZ algorithm, named MAS-SZZ, that facilitates the identification of vulnerability-inducing commits through collaboration among agents. Specifically, given a CVE description and its corresponding fixing commit, MAS-SZZ summarizes the root cause of the vulnerability and employs a structured step-forward prompting strategy to localize vulnerability-related statements based on the change intent of each patch hunk. These vulnerable statements serve as anchors from which MAS-SZZ autonomously traces backward through the repository's history to find the commit that first introduced the vulnerability. Extensive experiments show that MAS-SZZ outperforms the state-of-the-art baselines across datasets and programming languages, achieving F1-score gains of up to 65.22% over the best-performing SZZ algorithm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MAS-SZZ, a multi-agentic SZZ algorithm for identifying vulnerability-inducing commits. Given a CVE description and fixing commit, agents summarize the root cause and apply structured step-forward prompting to localize vulnerability-related statements within patch hunks; these statements serve as anchors for autonomous backward tracing through repository history. The central claim is that MAS-SZZ outperforms prior SZZ variants (including V-SZZ and LLM4SZZ) across datasets and languages, with F1-score gains reaching 65.22%.
Significance. If the empirical results are robust, the work offers a meaningful improvement to a foundational technique in software security. Accurate vulnerability-inducing commit identification supports downstream tasks such as vulnerability detection and affected-version analysis. The multi-agent pipeline directly targets documented failure modes of anchor selection and backtracking in existing SZZ implementations, and the reported gains suggest practical impact if the localization step proves reliable.
major comments (2)
- [Abstract] Abstract: the performance claim of up to 65.22% F1 improvement is presented without any description of experimental design, baseline re-implementations, dataset construction criteria, statistical tests, or controls for selection effects. This absence prevents verification of the central empirical result.
- [Methods] Methods (localization and summarization pipeline): the approach rests on the premise that the multi-agent system produces accurate root-cause summaries and correctly identifies vulnerability-related statements in patch hunks. No ablation, error analysis, or human validation of localization accuracy is referenced, leaving open the possibility that downstream backtracking errors arise from this step.
minor comments (1)
- [Abstract] Abstract: the phrase 'structured step-forward prompting strategy' is introduced without a brief illustrative example or pseudocode, reducing immediate clarity for readers unfamiliar with the technique.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the performance claim of up to 65.22% F1 improvement is presented without any description of experimental design, baseline re-implementations, dataset construction criteria, statistical tests, or controls for selection effects. This absence prevents verification of the central empirical result.
Authors: We agree that the abstract, constrained by length, omits key experimental details. The full manuscript (Sections 4 and 5) describes the datasets (CVE-linked fixing commits across languages), baseline re-implementations (V-SZZ and LLM4SZZ), evaluation using precision/recall/F1, and dataset construction from public vulnerability repositories. To address the concern, we will revise the abstract to concisely note the evaluation setup, datasets, and metrics, while retaining the high-level claim. We will also add a brief reference to statistical significance testing in the experiments section for completeness. revision: yes
-
Referee: [Methods] Methods (localization and summarization pipeline): the approach rests on the premise that the multi-agent system produces accurate root-cause summaries and correctly identifies vulnerability-related statements in patch hunks. No ablation, error analysis, or human validation of localization accuracy is referenced, leaving open the possibility that downstream backtracking errors arise from this step.
Authors: We acknowledge the importance of validating the localization and summarization components. The manuscript details the multi-agent pipeline and structured prompting in Section 3, but does not include dedicated ablation studies, error analysis, or human validation specifically for root-cause summary accuracy and statement localization. In the revised version, we will add an ablation study isolating the localization step, an error analysis of failure cases, and a human evaluation on a subset of samples to quantify the accuracy of the anchors used for backtracking. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper proposes MAS-SZZ as a novel multi-agent pipeline that takes CVE descriptions and fixing commits as inputs, summarizes root causes, localizes vulnerable statements via structured step-forward prompting on patch hunks, and then performs autonomous history backtracking to identify inducing commits. Evaluation consists of direct comparisons against external baselines (V-SZZ, LLM4SZZ) on separate datasets across languages, reporting F1 gains. No equations, fitted parameters, self-definitional reductions, or load-bearing self-citations appear in the provided text; the central claims rest on the independent algorithmic construction and external empirical results rather than reducing to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Lingfeng Bao, Xin Xia, Ahmed E. Hassan, and Xiaohu Yang. 2022. V-SZZ: Au- tomatic Identification of Version Ranges Affected by CVE Vulnerabilities. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineer- ing (ICSE). 2352–2364
work page 2022
-
[2]
Xingchu Chen, Chengwei Liu, Jialun Cao, Yang Xiao, Xinyue Cai, Yeting Li, Jingyi Shi, Tianqi Sun, Haiming Chen, and Wei Huo. 2025. Vulnerability-Affected Versions Identification: How Far Are We?. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2970–2982
work page 2025
-
[3]
Daniel Alencar da Costa, Shane McIntosh, Weiyi Shang, Uirá Kulesza, Roberta Coelho, and Ahmed E. Hassan. 2017. A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes.IEEE Trans. Software Eng.43, 7 (2017), 641–657
work page 2017
-
[4]
Steven Davies, Marc Roper, and Murray Wood. 2014. Comparing text-based and dependence-based approaches for determining the origins of bugs.J. Softw. Evol. Process.26, 1 (2014), 107–139
work page 2014
-
[5]
Kim Herzig, Sascha Just, and Andreas Zeller. 2016. The Impact of Tangled Code Changes on Defect Prediction Models.Empir. Softw. Eng.21, 2 (2016), 303–336
work page 2016
-
[6]
Torge Hinrichs, Emanuele Iannone, Tamás Aladics, Péter Hegedűs, Andrea De Lucia, Fabio Palomba, and Riccardo Scandariato. 2026. Back to the Roots: As- sessing Mining Techniques for Java Vulnerability-Contributing Commits.ACM Trans. Softw. Eng. Methodol.(2026)
work page 2026
-
[7]
Sunghun Kim, Thomas Zimmermann, Kai Pan, and E. James Whitehead Jr. 2006. Automatic Identification of Bug-Introducing Changes. InProceedings of the 21st IEEE/ACM International Conference on Automated Software Engineering (ASE). 81–90
work page 2006
-
[8]
Yi Li, Aashish Yadavally, Jiaxing Zhang, Shaohua Wang, and Tien N. Nguyen. 2023. Commit-Level, Neural Vulnerability Detection and Assessment. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 1024–1036
work page 2023
-
[9]
Yunbo Lyu, Hong Jin Kang, Ratnadira Widyasari, Julia Lawall, and David Lo
-
[10]
Evaluating SZZ Implementations: An Empirical Study on the Linux Kernel. IEEE Trans. Software Eng.50, 9 (2024), 2219–2239
work page 2024
-
[11]
Viet Hung Nguyen, Stanislav Dashevskyi, and Fabio Massacci. 2016. An automatic method for assessing the versions affected by a vulnerability.Empirical Software Engineering21, 6 (2016), 2268–2297
work page 2016
-
[12]
Christophe Rezk, Yasutaka Kamei, and Shane McIntosh. 2022. The Ghost Commit Problem When Identifying Fix-Inducing Changes: An Empirical Study of Apache Projects.IEEE Trans. Software Eng.48, 9 (2022), 3297–3309
work page 2022
-
[13]
Giovanni Rosa, Luca Pascarella, Simone Scalabrino, Rosalia Tufano, Gabriele Bavota, Michele Lanza, and Rocco Oliveto. 2021. Evaluating SZZ Implementations Through a Developer-informed Oracle. InProceedings of the 43rd IEEE/ACM International Conference on Software Engineering (ICSE). 436–447
work page 2021
-
[14]
Jacek Sliwerski, Thomas Zimmermann, and Andreas Zeller. 2005. When do changes induce fixes?ACM SIGSOFT Softw. Eng. Notes30, 4 (2005), 1–5
work page 2005
-
[15]
Shiyu Sun, Yunlong Xing, Xinda Wang, Shu Wang, Qi Li, and Kun Sun. 2025. DISPATCH: Unraveling Security Patches from Entangled Code Changes. InPro- ceedings of the 34th USENIX Security Symposium (Security). 4521–4540
work page 2025
-
[16]
Xiaobing Sun, Mingxuan Zhou, Sicong Cao, Xiaoxue Wu, Lili Bo, Di Wu, Bin Li, and Yang Xiang. 2025. HgtJIT: Just-in-Time Vulnerability Detection Based on Heterogeneous Graph Transformer.IEEE Trans. Dependable Secur. Comput.22, 6 (2025), 6522–6538
work page 2025
-
[17]
Lingxiao Tang, Jiakun Liu, Zhongxin Liu, Xiaohu Yang, and Lingfeng Bao. 2025. LLM4SZZ: Enhancing SZZ Algorithm with Context-Enhanced Assessment on Large Language Models.Proc. ACM Softw. Eng.2, ISSTA (2025), 343–365
work page 2025
-
[18]
Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, and Yu Cheng. 2024. MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution. InProceedings of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS). 51963–51993
work page 2024
-
[19]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InProceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS). 24824–24837
work page 2022
-
[20]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InProceedings of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS). 50528–50652
work page 2024
-
[21]
Songtao Yang, Yubo He, Kaixiang Chen, Zheyu Ma, Xiapu Luo, Yong Xie, Jianjun Chen, and Chao Zhang. 2023. 1dFuzz: Reproduce 1-Day Vulnerabilities with Di- rected Differential Fuzzing. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 867–879
work page 2023
-
[22]
Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, and Hui Liu. 2025. A First Look at Conventional Commits Classification. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). 2277–2289
work page 2025
-
[23]
Jian Zhang, Chong Wang, Anran Li, Weisong Sun, Cen Zhang, Wei Ma, and Yang Liu. 2026. Evaluating Large Language Models for Line-Level Vulnerability Localization.IEEE Trans. Software Eng.52, 3 (2026), 770–785
work page 2026
-
[24]
Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo. 2025. Large Language Model for Vulnerability Detection and Repair: Literature Review and the Road Ahead. ACM Trans. Softw. Eng. Methodol.34, 5 (2025), 145:1–145:31
work page 2025
-
[25]
Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Mingyue Leng, and Xiaoguang Mao. 2026. Atomizer: An LLM-based Collaborative Multi-Agent Framework for Intent-Driven Commit Untangling. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE)
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.