pith. sign in

arxiv: 2604.24398 · v1 · submitted 2026-04-27 · 💻 cs.CR · cs.SE

MAS-SZZ: Multi-Agentic SZZ Algorithm for Vulnerability-Inducing Commit Identification

Pith reviewed 2026-05-08 02:54 UTC · model grok-4.3

classification 💻 cs.CR cs.SE
keywords vulnerability-inducing commitSZZ algorithmmulti-agent systemCVE descriptionpatch hunk localizationroot cause summarizationcode history tracingsoftware security
0
0 comments X

The pith

MAS-SZZ uses multi-agent collaboration to summarize root causes and localize vulnerable statements for more accurate backtracking to vulnerability-inducing commits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to solve the problem of reliably identifying the earliest commit that introduced a software vulnerability, a task that underpins vulnerability detection, affected version analysis, and other security work. Prior SZZ variants, including V-SZZ and LLM4SZZ, often fail because they select poor anchors in the fixing patch and cannot trace backward effectively. MAS-SZZ instead deploys cooperating agents that first distill the vulnerability root cause from the CVE description and fixing commit, then apply structured step-forward prompting to mark vulnerability-related statements inside each patch hunk. These statements become the anchors from which the system traces the repository history to locate the introducing commit. Experiments across datasets and languages report F1-score improvements of up to 65.22 percent over the strongest baseline.

Core claim

MAS-SZZ identifies vulnerability-inducing commits by having agents collaborate on two steps: they summarize the root cause from the given CVE description and fixing commit, then use structured step-forward prompting to localize vulnerability-related statements from the change intent of each patch hunk; those statements serve as anchors for autonomous backward tracing through the repository history to the commit that first introduced the vulnerability.

What carries the argument

Multi-agent system that summarizes the root cause and applies structured step-forward prompting to localize vulnerability-related statements from patch hunks, which then act as anchors for historical backtracking.

If this is right

  • Supplies a stronger foundation for downstream security tasks such as vulnerability detection and affected-version analysis.
  • Delivers F1-score gains of up to 65.22 percent over the best-performing prior SZZ algorithm across multiple datasets and languages.
  • Addresses the specific failures of incorrect anchor selection and inadequate backtracking that limited V-SZZ and LLM4SZZ.
  • Enables autonomous tracing from patch-derived anchors without manual intervention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agent-collaboration pattern could be tested on identifying the origins of non-vulnerability bugs or other code-change types.
  • Performance may vary on vulnerabilities lacking detailed CVE descriptions, since the method depends on those descriptions for root-cause summarization.
  • Replacing the current agents with larger or fine-tuned models might increase localization accuracy but would require separate validation.
  • The approach could be combined with static-analysis tools to cross-check the localized statements before backtracking begins.

Load-bearing premise

The multi-agent system can reliably summarize the root cause from the CVE and fixing commit and correctly localize vulnerability-related statements from patch hunks without introducing errors that invalidate the subsequent backtracking.

What would settle it

A hand-labeled sample of CVEs in which the agents' summarized root cause or localized statements are shown to be incorrect, or a new dataset on which MAS-SZZ fails to produce F1 gains over the best prior SZZ method.

Figures

Figures reproduced from arXiv: 2604.24398 by Fu Xiao, Jing Yang, Jinxuan Xu, Le Yu, Linlin Zhu, Sicong Cao, Xingwei Lin.

Figure 1
Figure 1. Figure 1: Overview of MAS-SZZ. adopted proprietary coding agent, while SWE-agent [19] bridges LLMs and terminal environments by structuring tool interactions. 3 METHODOLOGY 3.1 Overview As shown in view at source ↗
Figure 2
Figure 2. Figure 2: A running example detailing the full workflow of view at source ↗
read the original abstract

Accurate vulnerability-inducing commit identification serves as a foundation for a series of software security tasks, such as vulnerability detection and affected version analysis. A straightforward solution is the SZZ algorithm, which traces back through the code history to identify the earliest commit that modify the vulnerable code. Unfortunately, neither the customized V-SZZ nor state-of-the-art LLM4SZZ perform satisfactorily due to the incorrect anchor selection and inadequate backtracking capability, making them far beyond a reliable usage in practice. To overcome these challenges, we propose a multi-agentic SZZ algorithm, named MAS-SZZ, that facilitates the identification of vulnerability-inducing commits through collaboration among agents. Specifically, given a CVE description and its corresponding fixing commit, MAS-SZZ summarizes the root cause of the vulnerability and employs a structured step-forward prompting strategy to localize vulnerability-related statements based on the change intent of each patch hunk. These vulnerable statements serve as anchors from which MAS-SZZ autonomously traces backward through the repository's history to find the commit that first introduced the vulnerability. Extensive experiments show that MAS-SZZ outperforms the state-of-the-art baselines across datasets and programming languages, achieving F1-score gains of up to 65.22% over the best-performing SZZ algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MAS-SZZ, a multi-agentic SZZ algorithm for identifying vulnerability-inducing commits. Given a CVE description and fixing commit, agents summarize the root cause and apply structured step-forward prompting to localize vulnerability-related statements within patch hunks; these statements serve as anchors for autonomous backward tracing through repository history. The central claim is that MAS-SZZ outperforms prior SZZ variants (including V-SZZ and LLM4SZZ) across datasets and languages, with F1-score gains reaching 65.22%.

Significance. If the empirical results are robust, the work offers a meaningful improvement to a foundational technique in software security. Accurate vulnerability-inducing commit identification supports downstream tasks such as vulnerability detection and affected-version analysis. The multi-agent pipeline directly targets documented failure modes of anchor selection and backtracking in existing SZZ implementations, and the reported gains suggest practical impact if the localization step proves reliable.

major comments (2)
  1. [Abstract] Abstract: the performance claim of up to 65.22% F1 improvement is presented without any description of experimental design, baseline re-implementations, dataset construction criteria, statistical tests, or controls for selection effects. This absence prevents verification of the central empirical result.
  2. [Methods] Methods (localization and summarization pipeline): the approach rests on the premise that the multi-agent system produces accurate root-cause summaries and correctly identifies vulnerability-related statements in patch hunks. No ablation, error analysis, or human validation of localization accuracy is referenced, leaving open the possibility that downstream backtracking errors arise from this step.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'structured step-forward prompting strategy' is introduced without a brief illustrative example or pseudocode, reducing immediate clarity for readers unfamiliar with the technique.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the performance claim of up to 65.22% F1 improvement is presented without any description of experimental design, baseline re-implementations, dataset construction criteria, statistical tests, or controls for selection effects. This absence prevents verification of the central empirical result.

    Authors: We agree that the abstract, constrained by length, omits key experimental details. The full manuscript (Sections 4 and 5) describes the datasets (CVE-linked fixing commits across languages), baseline re-implementations (V-SZZ and LLM4SZZ), evaluation using precision/recall/F1, and dataset construction from public vulnerability repositories. To address the concern, we will revise the abstract to concisely note the evaluation setup, datasets, and metrics, while retaining the high-level claim. We will also add a brief reference to statistical significance testing in the experiments section for completeness. revision: yes

  2. Referee: [Methods] Methods (localization and summarization pipeline): the approach rests on the premise that the multi-agent system produces accurate root-cause summaries and correctly identifies vulnerability-related statements in patch hunks. No ablation, error analysis, or human validation of localization accuracy is referenced, leaving open the possibility that downstream backtracking errors arise from this step.

    Authors: We acknowledge the importance of validating the localization and summarization components. The manuscript details the multi-agent pipeline and structured prompting in Section 3, but does not include dedicated ablation studies, error analysis, or human validation specifically for root-cause summary accuracy and statement localization. In the revised version, we will add an ablation study isolating the localization step, an error analysis of failure cases, and a human evaluation on a subset of samples to quantify the accuracy of the anchors used for backtracking. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper proposes MAS-SZZ as a novel multi-agent pipeline that takes CVE descriptions and fixing commits as inputs, summarizes root causes, localizes vulnerable statements via structured step-forward prompting on patch hunks, and then performs autonomous history backtracking to identify inducing commits. Evaluation consists of direct comparisons against external baselines (V-SZZ, LLM4SZZ) on separate datasets across languages, reporting F1 gains. No equations, fitted parameters, self-definitional reductions, or load-bearing self-citations appear in the provided text; the central claims rest on the independent algorithmic construction and external empirical results rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on the unverified assumption that LLM agents can accurately perform root-cause summarization and statement localization; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5532 in / 1094 out tokens · 100773 ms · 2026-05-08T02:54:27.929586+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    Hassan, and Xiaohu Yang

    Lingfeng Bao, Xin Xia, Ahmed E. Hassan, and Xiaohu Yang. 2022. V-SZZ: Au- tomatic Identification of Version Ranges Affected by CVE Vulnerabilities. In Proceedings of the 44th IEEE/ACM International Conference on Software Engineer- ing (ICSE). 2352–2364

  2. [2]

    Xingchu Chen, Chengwei Liu, Jialun Cao, Yang Xiao, Xinyue Cai, Yeting Li, Jingyi Shi, Tianqi Sun, Haiming Chen, and Wei Huo. 2025. Vulnerability-Affected Versions Identification: How Far Are We?. InProceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2970–2982

  3. [3]

    Daniel Alencar da Costa, Shane McIntosh, Weiyi Shang, Uirá Kulesza, Roberta Coelho, and Ahmed E. Hassan. 2017. A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes.IEEE Trans. Software Eng.43, 7 (2017), 641–657

  4. [4]

    Steven Davies, Marc Roper, and Murray Wood. 2014. Comparing text-based and dependence-based approaches for determining the origins of bugs.J. Softw. Evol. Process.26, 1 (2014), 107–139

  5. [5]

    Kim Herzig, Sascha Just, and Andreas Zeller. 2016. The Impact of Tangled Code Changes on Defect Prediction Models.Empir. Softw. Eng.21, 2 (2016), 303–336

  6. [6]

    Torge Hinrichs, Emanuele Iannone, Tamás Aladics, Péter Hegedűs, Andrea De Lucia, Fabio Palomba, and Riccardo Scandariato. 2026. Back to the Roots: As- sessing Mining Techniques for Java Vulnerability-Contributing Commits.ACM Trans. Softw. Eng. Methodol.(2026)

  7. [7]

    James Whitehead Jr

    Sunghun Kim, Thomas Zimmermann, Kai Pan, and E. James Whitehead Jr. 2006. Automatic Identification of Bug-Introducing Changes. InProceedings of the 21st IEEE/ACM International Conference on Automated Software Engineering (ASE). 81–90

  8. [8]

    Yi Li, Aashish Yadavally, Jiaxing Zhang, Shaohua Wang, and Tien N. Nguyen. 2023. Commit-Level, Neural Vulnerability Detection and Assessment. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 1024–1036

  9. [9]

    Yunbo Lyu, Hong Jin Kang, Ratnadira Widyasari, Julia Lawall, and David Lo

  10. [10]

    IEEE Trans

    Evaluating SZZ Implementations: An Empirical Study on the Linux Kernel. IEEE Trans. Software Eng.50, 9 (2024), 2219–2239

  11. [11]

    Viet Hung Nguyen, Stanislav Dashevskyi, and Fabio Massacci. 2016. An automatic method for assessing the versions affected by a vulnerability.Empirical Software Engineering21, 6 (2016), 2268–2297

  12. [12]

    Christophe Rezk, Yasutaka Kamei, and Shane McIntosh. 2022. The Ghost Commit Problem When Identifying Fix-Inducing Changes: An Empirical Study of Apache Projects.IEEE Trans. Software Eng.48, 9 (2022), 3297–3309

  13. [13]

    Giovanni Rosa, Luca Pascarella, Simone Scalabrino, Rosalia Tufano, Gabriele Bavota, Michele Lanza, and Rocco Oliveto. 2021. Evaluating SZZ Implementations Through a Developer-informed Oracle. InProceedings of the 43rd IEEE/ACM International Conference on Software Engineering (ICSE). 436–447

  14. [14]

    Jacek Sliwerski, Thomas Zimmermann, and Andreas Zeller. 2005. When do changes induce fixes?ACM SIGSOFT Softw. Eng. Notes30, 4 (2005), 1–5

  15. [15]

    Shiyu Sun, Yunlong Xing, Xinda Wang, Shu Wang, Qi Li, and Kun Sun. 2025. DISPATCH: Unraveling Security Patches from Entangled Code Changes. InPro- ceedings of the 34th USENIX Security Symposium (Security). 4521–4540

  16. [16]

    Xiaobing Sun, Mingxuan Zhou, Sicong Cao, Xiaoxue Wu, Lili Bo, Di Wu, Bin Li, and Yang Xiang. 2025. HgtJIT: Just-in-Time Vulnerability Detection Based on Heterogeneous Graph Transformer.IEEE Trans. Dependable Secur. Comput.22, 6 (2025), 6522–6538

  17. [17]

    Lingxiao Tang, Jiakun Liu, Zhongxin Liu, Xiaohu Yang, and Lingfeng Bao. 2025. LLM4SZZ: Enhancing SZZ Algorithm with Context-Enhanced Assessment on Large Language Models.Proc. ACM Softw. Eng.2, ISSTA (2025), 343–365

  18. [18]

    Wei Tao, Yucheng Zhou, Yanlin Wang, Wenqiang Zhang, Hongyu Zhang, and Yu Cheng. 2024. MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution. InProceedings of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS). 51963–51993

  19. [19]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InProceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS). 24824–24837

  20. [20]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Inter- faces Enable Automated Software Engineering. InProceedings of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS). 50528–50652

  21. [21]

    Songtao Yang, Yubo He, Kaixiang Chen, Zheyu Ma, Xiapu Luo, Yong Xie, Jianjun Chen, and Chao Zhang. 2023. 1dFuzz: Reproduce 1-Day Vulnerabilities with Di- rected Differential Fuzzing. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 867–879

  22. [22]

    Qunhong Zeng, Yuxia Zhang, Zhiqing Qiu, and Hui Liu. 2025. A First Look at Conventional Commits Classification. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). 2277–2289

  23. [23]

    Jian Zhang, Chong Wang, Anran Li, Weisong Sun, Cen Zhang, Wei Ma, and Yang Liu. 2026. Evaluating Large Language Models for Line-Level Vulnerability Localization.IEEE Trans. Software Eng.52, 3 (2026), 770–785

  24. [24]

    Xin Zhou, Sicong Cao, Xiaobing Sun, and David Lo. 2025. Large Language Model for Vulnerability Detection and Repair: Literature Review and the Road Ahead. ACM Trans. Softw. Eng. Methodol.34, 5 (2025), 145:1–145:31

  25. [25]

    Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Mingyue Leng, and Xiaoguang Mao. 2026. Atomizer: An LLM-based Collaborative Multi-Agent Framework for Intent-Driven Commit Untangling. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE)