pith. sign in

arxiv: 2607.01760 · v1 · pith:NTNKNDUZnew · submitted 2026-07-02 · 💻 cs.SE

Refploit: Facilitating Exploit Construction via Code-Agent Trajectory Repair

Pith reviewed 2026-07-03 09:06 UTC · model grok-4.3

classification 💻 cs.SE
keywords exploit reproductioncode agentstrajectory repairJava vulnerabilitiesdifferential executionLLM-based frameworkPoC generationvulnerability assessment
0
0 comments X

The pith

Refploit repairs failed code-agent trajectories via differential execution validation to reproduce 80.2 percent of Java vulnerability exploits from incomplete public references.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Refploit as a framework that turns incomplete or partial exploit references into runnable code by fixing the trajectories produced by code agents. It runs candidate exploits on both vulnerable and patched versions of a library to check whether the vulnerable logic was actually reached. When the check fails, the system identifies the subtasks already completed in the trajectory, marks misleading paths to avoid, and generates new constraints that steer the agent toward completion. Across 172 references for 143 Java vulnerabilities, this raises reproduction success from the agents' initial outputs to 80.2 percent, a 64.3 percent relative improvement, while also exceeding prior exploit-generation tools.

Core claim

Refploit is an LLM-based trajectory recovery framework that first validates an agent-generated exploit through differential execution between vulnerable and patched library versions. When the exploit proves ineffective, Refploit analyzes the reproduction progress within the trajectory, locates the segments tied to completed subtasks, and derives constraints to guide focused recovery of the remaining steps.

What carries the argument

Differential execution validator that compares outcomes on vulnerable versus patched library versions, combined with progress analysis to extract and apply recovery constraints from partial trajectories.

If this is right

  • More public exploit references can be turned into working tests without full manual reconstruction.
  • The same recovery process yields consistent gains when applied to a different underlying code agent.
  • Automated reproduction becomes feasible at larger scale across open-source Java libraries.
  • Exploit-generation pipelines that previously stopped at first failure can now continue from partial successes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Security teams could integrate trajectory repair into continuous vulnerability scanning to produce proof-of-concept tests faster.
  • The technique may extend to other languages if equivalent pairs of vulnerable and patched library builds are available for differential testing.
  • Failed agent runs become a reusable source of negative examples that can improve future agent prompting without additional human labeling.

Load-bearing premise

Differential execution between vulnerable and patched library versions can reliably tell whether an agent-generated exploit actually triggers the vulnerable logic rather than producing superficially successful but ineffective code.

What would settle it

A collection of cases in which differential execution reports success but independent manual inspection or additional test oracles show the exploit never reaches the vulnerable code path.

Figures

Figures reproduced from arXiv: 2607.01760 by Jiayuan Zhou, Xiaohu Yang, Xing Hu, Xin Xia, Zhipeng Xue, Zirui Chen.

Figure 1
Figure 1. Figure 1: Refining the initial trajectory of reproducing CVE-2024-23833 by R [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework and prompt details of R [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Vulnerability exploits play a crucial role in assessing the downstream impact of Java library vulnerabilities. While some vulnerabilities are accompanied by disclosed exploit references, automatically reproducing such references into runnable exploits remains challenging because they are often incomplete, unstructured, or only describe partial reproduction steps. Recent code agents provide a promising way to automate this process, but our study shows that their generated exploits often appear successful without triggering the actual vulnerable logic, such as replacing vulnerable APIs with self-implemented functions. To address this, we propose Refploit, an LLM-based trajectory recovery framework for facilitating vulnerability reproduction from public exploit references. The key insight is that a failed agent trajectory is not entirely useless. It may have already completed some reproduction subtasks while also revealing misleading directions that should be avoided. Refploit first validates an agent-generated exploit through differential execution. When the exploit is ineffective, Refploit analyzes its reproduction progress, locates the trajectory segments associated with the reproduction progress, and derives constraints to guide focused recovery. We evaluate Refploit on three open-source Java vulnerability datasets, covering 172 exploit references for 143 vulnerabilities. Under DeepSeek-V4-Flash, Refploit successfully reproduces 138 exploits, achieving a reproduction rate of 80.2%. It achieves a 64.3% relative improvement over the initially generated trajectories and outperforms both the SOTA exploit-generation method PoCGen and advanced code agents such as Codex with GPT-5.4. We further adapt Refploit to another code agent and observe consistent improvements, demonstrating its generality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Refploit, an LLM-based framework for repairing failed code-agent trajectories to reproduce Java library vulnerability exploits from incomplete public references. It validates generated exploits via differential execution on vulnerable vs. patched library versions; for ineffective cases, it analyzes reproduction progress in the trajectory, locates relevant segments, and derives constraints to guide recovery. Evaluated on 172 exploit references for 143 vulnerabilities from three open-source Java datasets, Refploit achieves an 80.2% reproduction rate with DeepSeek-V4-Flash (64.3% relative improvement over initial trajectories) and outperforms PoCGen as well as agents such as Codex with GPT-5.4; consistent gains are shown when adapted to another code agent.

Significance. If the differential execution oracle is shown to reliably identify trajectories that trigger the vulnerable logic (rather than superficial behavioral differences), the work offers a practical advance in automated exploit construction by salvaging partial progress from failed agent runs. The multi-dataset evaluation, direct comparisons to SOTA methods, and cross-agent generality provide a solid empirical foundation for the central claims.

major comments (2)
  1. [Abstract and method description] Abstract and method description: Differential execution is the load-bearing oracle for the reported 80.2% reproduction rate and 64.3% relative improvement. The abstract itself notes that naive agents can appear successful by replacing vulnerable APIs with self-implemented functions; the manuscript must specify (with concrete implementation details) how the differential check rules out analogous false positives arising from side effects, unrelated exception paths, or version-specific behavior orthogonal to the CVE.
  2. [Evaluation] Evaluation: The outperformance claims versus PoCGen and Codex with GPT-5.4 rest on the reproduction metric; without explicit controls (identical LLM back-ends, prompt templates, and success criteria across baselines) the 64.3% relative gain cannot be isolated from implementation differences.
minor comments (1)
  1. [Abstract] Clarify whether the 172 references map one-to-one with the 143 vulnerabilities or include multiple references per vulnerability, and report per-dataset breakdowns of the 138 successful reproductions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting areas where additional clarity will strengthen the manuscript. We address each major comment below and commit to revisions that improve the presentation of the differential execution oracle and the fairness of baseline comparisons.

read point-by-point responses
  1. Referee: [Abstract and method description] Abstract and method description: Differential execution is the load-bearing oracle for the reported 80.2% reproduction rate and 64.3% relative improvement. The abstract itself notes that naive agents can appear successful by replacing vulnerable APIs with self-implemented functions; the manuscript must specify (with concrete implementation details) how the differential check rules out analogous false positives arising from side effects, unrelated exception paths, or version-specific behavior orthogonal to the CVE.

    Authors: We agree that the current description of the differential execution oracle would benefit from more concrete implementation details to explicitly rule out the listed classes of false positives. In the revised manuscript we will add a new subsection (under Section 3.2) that provides: (1) the exact instrumentation points used to capture CVE-relevant API calls and exception signatures; (2) the trace-comparison logic that requires a behavioral divergence only on the vulnerable path while discarding unrelated side effects and orthogonal version differences; and (3) pseudocode together with a worked example from one of the evaluated CVEs. These additions will make the oracle’s false-positive rejection criteria fully reproducible. revision: yes

  2. Referee: [Evaluation] Evaluation: The outperformance claims versus PoCGen and Codex with GPT-5.4 rest on the reproduction metric; without explicit controls (identical LLM back-ends, prompt templates, and success criteria across baselines) the 64.3% relative gain cannot be isolated from implementation differences.

    Authors: We acknowledge the importance of explicit controls. The success criterion (differential execution on vulnerable vs. patched versions) was applied uniformly to all methods. However, the manuscript does not currently tabulate the precise prompt templates or back-end configurations used for each baseline. We will add an appendix that lists the exact prompts, confirms identical success criteria, and notes any unavoidable differences (e.g., Codex with GPT-5.4 uses its native agent interface). This will allow readers to isolate the contribution of Refploit’s recovery mechanism from implementation variance. revision: yes

Circularity Check

0 steps flagged

No circularity; evaluation uses external datasets and independent baselines

full rationale

The paper evaluates Refploit on three open-source Java vulnerability datasets (172 exploit references for 143 vulnerabilities) and reports reproduction rates against those fixed references. Success is measured via differential execution between vulnerable and patched library versions, with explicit comparisons to external methods (PoCGen, Codex+GPT-5.4). No parameters are fitted to the target metric, no self-citations form the load-bearing justification, and no derivation reduces to a renaming or self-definition of the reported 80.2% rate or 64.3% improvement. The central claims are therefore falsifiable against the cited external corpora and baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the three Java vulnerability datasets and the assumption that differential execution provides a valid oracle for exploit effectiveness; no free parameters or invented entities are described.

axioms (1)
  • domain assumption The three open-source Java vulnerability datasets covering 172 exploit references are representative for evaluating reproduction methods.
    Evaluation results are reported on these specific datasets without further justification of their coverage or selection criteria.

pith-pipeline@v0.9.1-grok · 5821 in / 1238 out tokens · 27797 ms · 2026-07-03T09:06:48.885347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 31 canonical work pages · 3 internal anchors

  1. [1]

    OPEN SOURCE SECURITY AND RISK ANALYSIS REPORT 2023

    Synopsys. OPEN SOURCE SECURITY AND RISK ANALYSIS REPORT 2023. [Online]. Available: https://www.synopsys.com/software-integrity/resources/analyst-reports/ open-source-security-risk-analysis.html

  2. [2]

    Cneps: A precise approach for examining dependencies among third-party c/c++ open-source components,

    Y . Na, S. Woo, J. Lee, and H. Lee, “Cneps: A precise approach for examining dependencies among third-party c/c++ open-source components,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/3597503.3639209

  3. [3]

    Do developers update their library dependencies? an empirical study on the impact of security advisories on library migration,

    R. G. Kula, D. M. German, A. Ouni, T. Ishio, and K. Inoue, “Do developers update their library dependencies? an empirical study on the impact of security advisories on library migration,”Empirical Software Engineering, vol. 23, pp. 384–417, 2018

  4. [4]

    Symbisect: accurate bisection for fuzzer-exposed vulner- abilities,

    Z. Zhang, Y . Hao, W. Chen, X. Zou, X. Li, H. Li, Y . Zhai, Z. Qian, and B. Lau, “Symbisect: accurate bisection for fuzzer-exposed vulner- abilities,” inProceedings of the 33rd USENIX Conference on Security Symposium, ser. SEC ’24. USA: USENIX Association, 2024

  5. [5]

    Fixing outside the box: Uncovering tactics for open-source security issue management,

    L. Zhang, J. Wu, C. Liu, K. Li, X. Sun, L. Zhao, C. Wang, and Y . Liu, “Fixing outside the box: Uncovering tactics for open-source security issue management,”Proc. ACM Softw. Eng., vol. 2, no. ISSTA, Jun

  6. [6]

    Available: https://doi.org/10.1145/3728977

    [Online]. Available: https://doi.org/10.1145/3728977

  7. [7]

    Nodemedic- fine: Automatic detection and exploit synthesis for node. js vulnerabili- ties

    D. Cassel, N. Sabino, M.-C. Hsu, R. Martins, and L. Jia, “Nodemedic- fine: Automatic detection and exploit synthesis for node. js vulnerabili- ties.” inNDSS, 2025

  8. [8]

    Smallworld with high risks: a study of security threats in the npm ecosystem,

    M. Zimmermann, C.-A. Staicu, C. Tenny, and M. Pradel, “Smallworld with high risks: a study of security threats in the npm ecosystem,” in Proceedings of the 28th USENIX Conference on Security Symposium, ser. SEC’19. USA: USENIX Association, 2019, p. 995–1010

  9. [9]

    Exploiting library vulnerability via migration based automating test generation,

    S. Wu, W. Song, K. Huang, B. Chen, and X. Peng, “Identifying affected libraries and their ecosystems for open source software vulnerabilities,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. New York, NY , USA: Association for Computing Machinery, 2024. [Online]. Available: https://doi.org/10.1145/35975...

  10. [10]

    Exploiting library vulnerability via migration based automating test generation,

    Z. Chen, X. Hu, X. Xia, Y . Gao, T. Xu, D. Lo, and X. Yang, “Exploiting library vulnerability via migration based automating test generation,” in2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE). Los Alamitos, CA, USA: IEEE Computer Society, apr 2024, pp. 2820–2831. [Online]. Available: https://doi.ieeecomputersociety.org/10.1145/...

  11. [11]

    Automating dependency updates in practice: An exploratory study on github dependabot,

    R. He, H. He, Y . Zhang, and M. Zhou, “Automating dependency updates in practice: An exploratory study on github dependabot,”IEEE Trans. Softw. Eng., vol. 49, no. 8, p. 4004–4022, Aug. 2023. [Online]. Available: https://doi.org/10.1109/TSE.2023.3278129

  12. [12]

    An empirical study on vulnerability disclosure management of open source software systems,

    S. Liu, J. Zhou, X. Hu, F. R. Cogo, X. Xia, and X. Yang, “An empirical study on vulnerability disclosure management of open source software systems,”ACM Trans. Softw. Eng. Methodol., vol. 34, no. 7, Aug. 2025. [Online]. Available: https://doi.org/10.1145/3716822

  13. [13]

    How the apache community upgrades dependencies: an evolutionary study,

    G. Bavota, G. Canfora, M. Di Penta, R. Oliveto, and S. Panichella, “How the apache community upgrades dependencies: an evolutionary study,” Empirical Software Engineering, vol. 20, pp. 1275–1317, 2015

  14. [14]

    Libam: An area matching framework for detecting third-party libraries in binaries,

    S. Li, Y . Wang, C. Dong, S. Yang, H. Li, H. Sun, Z. Lang, Z. Chen, W. Wang, H. Zhu, and L. Sun, “Libam: An area matching framework for detecting third-party libraries in binaries,”ACM Trans. Softw. Eng. Methodol., vol. 33, no. 2, Dec. 2023. [Online]. Available: https://doi.org/10.1145/3625294

  15. [15]

    What are weak links in the npm supply chain?

    N. Zahan, T. Zimmermann, P. Godefroid, B. Murphy, C. Maddila, and L. Williams, “What are weak links in the npm supply chain?” inProceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice, ser. ICSE-SEIP ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 331–340. [Online]. Available: https...

  16. [16]

    Mvnrepository

    MvnRepository, “Mvnrepository.” [Online]. Available: https: //mvnrepository.com/repos

  17. [17]

    Automatic exploit generation,

    T. Avgerinos, S. K. Cha, A. Rebert, E. J. Schwartz, M. Woo, and D. Brumley, “Automatic exploit generation,”Commun. ACM, vol. 57, no. 2, p. 74–84, feb 2014. [Online]. Available: https: //doi.org/10.1145/2560217.2560219

  18. [18]

    Facilitating vulnerability assessment through poc migration,

    J. Dai, Y . Zhang, H. Xu, H. Lyu, Z. Wu, X. Xing, and M. Yang, “Facilitating vulnerability assessment through poc migration,” in Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, ser. CCS ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 3300–3317. [Online]. Available: https://doi.org/10.1145/34601...

  19. [19]

    Aem: Facilitating cross-version exploitability assessment of linux kernel vul- nerabilities,

    Z. Jiang, Y . Zhang, J. Xu, X. Sun, Z. Liu, and M. Yang, “Aem: Facilitating cross-version exploitability assessment of linux kernel vul- nerabilities,” in2023 IEEE Symposium on Security and Privacy (SP), 2023, pp. 2122–2137

  20. [20]

    Diffploit: Facilitating cross-version exploit migration for open source library vulnerabilities,

    Z. Chen, Z. Xue, J. Zhou, X. Hu, X. Xia, and X. Yang, “Diffploit: Facilitating cross-version exploit migration for open source library vulnerabilities,” 2025. [Online]. Available: https: //arxiv.org/abs/2511.12950

  21. [21]

    Magneto: A step-wise approach to exploit vulnerabilities in dependent libraries via llm-empowered directed fuzzing,

    Z. Zhou, Y . Yang, S. Wu, Y . Huang, B. Chen, and X. Peng, “Magneto: A step-wise approach to exploit vulnerabilities in dependent libraries via llm-empowered directed fuzzing,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1633–...

  22. [22]

    Test mimicry to assess the exploitability of library vulnerabilities,

    H. J. Kang, T. G. Nguyen, B. Le, C. S. P ˘as˘areanu, and D. Lo, “Test mimicry to assess the exploitability of library vulnerabilities,” in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2022. New York, NY , USA: Association for Computing Machinery, 2022, p. 276–288. [Online]. Available: https://doi...

  23. [23]

    Chainfuzz: exploiting upstream vulnerabilities in open-source supply chains,

    P. Deng, L. Zhang, Y . Meng, Z. Yang, Y . Zhang, and M. Yang, “Chainfuzz: exploiting upstream vulnerabilities in open-source supply chains,” inProceedings of the 34th USENIX Conference on Security Symposium, ser. SEC ’25. USA: USENIX Association, 2025

  24. [24]

    Secbench.js: An executable security benchmark suite for server-side javascript,

    M. H. M. Bhuiyan, A. S. Parthasarathy, N. Vasilakis, M. Pradel, and C.-A. Staicu, “Secbench.js: An executable security benchmark suite for server-side javascript,” in2023 IEEE/ACM 45th International Confer- ence on Software Engineering (ICSE), 2023, pp. 1059–1070

  25. [25]

    PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages

    D. Simsek, A. Eghbali, and M. Pradel, “Pocgen: Generating proof-of- concept exploits for vulnerabilities in npm packages,” 2025. [Online]. Available: https://arxiv.org/abs/2506.04962

  26. [26]

    Faultline: Automated proof-of-vulnerability generation using llm agents,

    V . Nitin, B. Ray, and R. Z. Moghaddam, “Faultline: Automated proof-of-vulnerability generation using llm agents,” 2025. [Online]. Available: https://arxiv.org/abs/2507.15241

  27. [27]

    IRIS: LLM-assisted static analysis for detecting security vulnerabilities,

    Z. Li, S. Dutta, and M. Naik, “IRIS: LLM-assisted static analysis for detecting security vulnerabilities,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=9LdJDU7E91

  28. [28]

    Vision: Identifying affected library versions for open source software vulnerabilities,

    S. Wu, R. Wang, K. Huang, Y . Cao, W. Song, Z. Zhou, Y . Huang, B. Chen, and X. Peng, “Vision: Identifying affected library versions for open source software vulnerabilities,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 1447–1...

  29. [29]

    You name it, i run it: An llm agent to execute tests of arbitrary projects,

    I. Bouzenia and M. Pradel, “You name it, i run it: An llm agent to execute tests of arbitrary projects,”Proc. ACM Softw. Eng., vol. 2, no. ISSTA, Jun. 2025. [Online]. Available: https://doi.org/10.1145/3728922

  30. [30]

    Cxxcrafter: An llm-based agent for automated c/c++ open source software building,

    Z. Yu, Y . Zhang, M. Wen, Y . Nie, W. Zhang, and M. Yang, “Cxxcrafter: An llm-based agent for automated c/c++ open source software building,”Proc. ACM Softw. Eng., vol. 2, no. FSE, Jun. 2025. [Online]. Available: https://doi.org/10.1145/3729386

  31. [31]

    Agentboard: An analytical evaluation board of multi-turn LLM agents,

    C. Ma, J. Zhang, Z. Zhu, C. Yang, Y . Yang, Y . Jin, Z. Lan, L. Kong, and J. He, “Agentboard: An analytical evaluation board of multi-turn LLM agents,” inThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. [Online]. Available: https://openreview.net/forum?id=4S8agvKjle

  32. [32]

    A large-scale empirical study on the generalizability of disclosed java library vulnerability exploits,

    Z. Chen, Q. Zhan, J. Zhou, X. Hu, X. Xia, and X. Yang, “A large-scale empirical study on the generalizability of disclosed java library vulnerability exploits,” 2026. [Online]. Available: https: //arxiv.org/abs/2603.25997

  33. [33]

    Deepseek-v4: Towards highly efficient million-token context intelligence,

    A. Xu, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, C. Linget al., “Deepseek-v4: Towards highly efficient million- token context intelligence,”arXiv preprint arXiv:2606.19348, 2026

  34. [34]

    Qwen3.5: Towards native multimodal agents,

    Qwen Team, “Qwen3.5: Towards native multimodal agents,” February

  35. [35]

    Available: https://qwen.ai/blog?id=qwen3.5

    [Online]. Available: https://qwen.ai/blog?id=qwen3.5

  36. [36]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. R. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://arxiv.org/abs/2405.15793

  37. [37]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig, “Openhands: An open platform for ai software developers as generalist agents,” 2025. [Online]. Available: https://arxiv.org/ab...

  38. [38]

    Introducing codex,

    OpenAI, “Introducing codex,” 2025. [Online]. Available: https: //openai.com/index/introducing-codex/

  39. [39]

    Autogpt,

    S. Gravitas, “Autogpt,” 2024. [Online]. Available: https://github.com/ Significant-Gravitas/AutoGPT

  40. [40]

    Hirebuild: an automatic approach to history- driven repair of build scripts,

    F. Hassan and X. Wang, “Hirebuild: an automatic approach to history- driven repair of build scripts,” inProceedings of the 40th International Conference on Software Engineering, ser. ICSE ’18. New York, NY , USA: Association for Computing Machinery, 2018, p. 1078–1089. [Online]. Available: https://doi.org/10.1145/3180155.3180181

  41. [41]

    Automatic building of java projects in software repositories: a study on feasibility and challenges,

    F. Hassan, S. Mostafa, E. S. L. Lam, and X. Wang, “Automatic building of java projects in software repositories: a study on feasibility and challenges,” inProceedings of the 11th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ser. ESEM ’17. IEEE Press, 2017, p. 38–47. [Online]. Available: https://doi.org/10.1109/ESEM.2017.11

  42. [42]

    Understanding build issue resolution in practice: symptoms and fix patterns,

    Y . Lou, Z. Chen, Y . Cao, D. Hao, and L. Zhang, “Understanding build issue resolution in practice: symptoms and fix patterns,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2020. New York, NY , USA: Association for Computing Machinery, 2020,...

  43. [43]

    Buildsheriff: change- aware test failure triage for continuous integration builds,

    C. Zhang, B. Chen, X. Peng, and W. Zhao, “Buildsheriff: change- aware test failure triage for continuous integration builds,” in Proceedings of the 44th International Conference on Software Engineering, ser. ICSE ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 312–324. [Online]. Available: https://doi.org/10.1145/3510003.3510132

  44. [44]

    1dfuzz: Reproduce 1-day vulnerabilities with directed differential fuzzing,

    S. Yang, Y . He, K. Chen, Z. Ma, X. Luo, Y . Xie, J. Chen, and C. Zhang, “1dfuzz: Reproduce 1-day vulnerabilities with directed differential fuzzing,” inProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2023. New York, NY , USA: Association for Computing Machinery, 2023, p. 867–879. [Online]. Availabl...

  45. [45]

    Why do multi-agent llm systems fail?

    M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, D. Klein, K. Ramchandranet al., “Why do multi-agent llm systems fail?”Advances in Neural Information Processing Systems, vol. 38, 2026

  46. [46]

    Trail: Trace reasoning and agentic issue localization,

    D. Deshpande, V . Gangal, H. Mehta, J. Krishnan, A. Kannappan, and R. Qian, “Trail: Trace reasoning and agentic issue localization,” 2025. [Online]. Available: https://arxiv.org/abs/2505.08638

  47. [47]

    Interactive debugging and steering of multi-agent ai systems,

    W. Epperson, G. Bansal, V . C. Dibia, A. Fourney, J. Gerrits, E. E. Zhu, and S. Amershi, “Interactive debugging and steering of multi-agent ai systems,” inProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, ser. CHI ’25. New York, NY , USA: Association for Computing Machinery, 2025. [Online]. Available: https://doi.org/10.1145/37...

  48. [48]

    Dover: Intervention-driven auto debugging for llm multi- agent systems,

    M. Ma, J. Zhang, F. Yang, Y . Kang, Q. Lin, S. Rajmohan, and D. Zhang, “Dover: Intervention-driven auto debugging for llm multi- agent systems,”arXiv preprint arXiv:2512.06749, 2025

  49. [49]

    Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis,

    Y . Ge, L. Xie, Z. Li, Y . Pei, and T. Zhang, “Who is introducing the failure? automatically attributing failures of multi-agent systems via spectrum analysis,” 2025. [Online]. Available: https://arxiv.org/abs/ 2509.13782