pith. machine review for the scientific record. sign in

arxiv: 2605.05000 · v1 · submitted 2026-05-06 · 💻 cs.CR · cs.LG

Recognition: unknown

Agentic Vulnerability Reasoning on Windows COM Binaries

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:51 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords agentic pipelinevulnerability discoveryrace conditionsWindows COMproof-of-conceptbinary analysisdynamic debugging
0
0 comments X

The pith

Tool interfaces for binary exploration and debugging let AI agents autonomously find and verify race condition vulnerabilities in Windows COM binaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a method for using AI agents to discover vulnerabilities in Windows Component Object Model services. The services run with high privileges and are accessible to users, so race conditions in them pose risks for local privilege escalation. The approach gives agents reusable tools to explore binaries, inspect COM details, and use dynamic debugging to get feedback. This setup allows the agents to identify issues and produce working proof-of-concept code that is verified in a debugger. When applied to actual production services, it located multiple new vulnerabilities that were later confirmed by the vendor.

Core claim

The central discovery is that exposing binary exploration, COM inspection, and dynamic debugging as reusable tool interfaces enables agents to autonomously discover race condition vulnerabilities in COM binaries and generate debugger-verified proof-of-concept code, with the pipeline proving effective enough to identify previously unknown issues in deployed Windows services.

What carries the argument

The end-to-end agentic pipeline with reusable tool interfaces for binary exploration, COM inspection, and dynamic debugging that provide the necessary context and feedback for vulnerability reasoning and PoC synthesis.

If this is right

  • The tool-equipped agents can progress from vulnerability discovery to creating verified PoCs in COM binaries.
  • This method identifies race conditions that can lead to local privilege escalation in high-privilege services.
  • The interfaces are generalizable and can be applied to other commercial off-the-shelf binaries.
  • Production Windows COM services harbor previously unknown vulnerabilities detectable through agentic reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar tool interfaces could be developed for analyzing vulnerabilities in other operating system components or software ecosystems.
  • This approach might reduce the manual effort required for reverse engineering complex binary interfaces in security research.
  • Extending the feedback mechanisms from debugging could help address additional classes of vulnerabilities beyond race conditions.

Load-bearing premise

That the provided tool interfaces for exploration, inspection, and debugging give agents enough information and feedback to reach verified PoCs without extensive human guidance in most cases.

What would settle it

Running the agent on a COM binary known to have a race condition vulnerability and observing whether it fails to generate a PoC that reproduces the issue when executed and checked under a debugger.

Figures

Figures reproduced from arXiv: 2605.05000 by Hwiwon Lee, Jongseong Kim, Lingming Zhang.

Figure 1
Figure 1. Figure 1: CVE-2024-49095: Race condition in SetPrintTicket of PrintWorkflowUserSvc. Thread #2 Allocate Heap Read Shared Data Free Heap Write Data To Freed Heap Thread #1 Thread #1 Thread #2 Check Allocation Check Allocation Free Heap Free Heap (a) Use After Free (b) Double Free (line 12) (line 18) (line 8) (line 9) (line 9) (line 11) (line 11) (line 11) view at source ↗
Figure 2
Figure 2. Figure 2: Two thread interleavings of concurrent SetPrintTicket calls on the same object, producing a use-after-free (left) and a double-free (right). Threading Model and Root Cause of Races. COM uses apartments to govern concurrency [16]. A single-threaded apart￾ment (STA) serializes all incoming calls through a Windows mes￾sage queue, making objects thread-safe by construction. A multi￾threaded apartment (MTA) all… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Slyp. A single ReAct agent operates in two stages connected by a structured vulnerability report. In Stage 1 (Discovery), the agent uses only the IDA MCP server for binary exploration. In Stage 2 (PoC Generation), the agent uses all three MCP servers: IDA for evidence re-checking, COM inspection for activation metadata and skeleton generation, and dynamic debugging for iterative compile-execute… view at source ↗
Figure 4
Figure 4. Figure 4: Overlap of verified cases among the four top config view at source ↗
Figure 5
Figure 5. Figure 5: Simplified code snippet for CVE-2025-53802. view at source ↗
read the original abstract

Windows Component Object Model (COM) services run with elevated privileges and are widely accessible to authenticated users, making race conditions in these binaries a critical surface for local privilege escalation. We present SLYP, an end-to-end agentic pipeline that discovers race condition vulnerabilities in COM binaries and generates debugger-verified proof-of-concept (PoC) code. SLYP exposes binary exploration, COM inspection, and dynamic debugging as reusable tool interfaces, giving agents the static context, COM activation metadata, and debugger feedback needed to move from vulnerability discovery to verified PoC generation. On a benchmark of 20 COM objects covering 40 vulnerability cases, SLYP achieves 0.973 F1, outperforming production coding agents by up to 0.208 F1 and the state-of-the-art static analyzer by 3.3x in bug discovery. For PoC generation, production coding agents in their default setup (without our COM inspection and dynamic debugging tools) verify essentially no cases on either frontier model, whereas SLYP's interactive toolsets enable it to autonomously synthesize working PoCs for 67.5% of cases on the strongest configuration. Deployed on production Windows services, SLYP discovers 28 previously unknown vulnerabilities across nine COM services, all confirmed by the Microsoft Security Response Center (MSRC) with 16 CVEs assigned and $140,000 in bounties. Furthermore, SLYP is designed with generalizable binary analysis and debugging interfaces, making it readily applicable to other commercial off-the-shelf (COTS) binaries beyond Windows COM services.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces SLYP, an end-to-end agentic pipeline that equips frontier LLMs with reusable tool interfaces for binary exploration, COM inspection, and dynamic debugging to discover race-condition vulnerabilities in Windows COM services and synthesize debugger-verified PoCs. On a benchmark of 20 COM objects spanning 40 cases, it reports 0.973 F1 (outperforming coding agents by up to 0.208 and static analysis by 3.3x) and 67.5% verified PoC success with full tools enabled (versus near-zero for default agents). In deployment on nine production services, SLYP identifies 28 previously unknown vulnerabilities, all externally validated by MSRC (16 CVEs assigned, $140k bounties). The design emphasizes generalizability to other COTS binaries.

Significance. If the empirical results and external validations hold, the work provides concrete evidence that domain-specific tool interfaces can enable agentic systems to perform non-trivial binary vulnerability discovery and exploit generation at scale, a task where unaugmented coding agents fail. The MSRC-confirmed findings and bounty payouts constitute strong real-world impact, while the reusable interfaces offer a template for extending automated analysis beyond Windows COM.

major comments (1)
  1. [Deployment / real-world evaluation section] Deployment / real-world evaluation section: The central claim that SLYP 'autonomously' discovers and verifies vulnerabilities in production services rests on the tool interfaces supplying sufficient static context, activation metadata, and runtime feedback. Given the benchmark result of only 67.5% verified PoC success even under the strongest configuration (and 0% for default agents), the manuscript should explicitly state the human-intervention protocol used for the 28 reported cases—e.g., whether any post-generation filtering, manual steering, or selective reporting occurred—so that readers can assess whether the end-to-end autonomy metric generalizes from benchmark to live services.
minor comments (2)
  1. [Benchmark construction paragraph] Benchmark construction paragraph: The criteria used to select the 20 COM objects and 40 vulnerability cases (and any exclusion rules) are not fully detailed; adding this information would allow assessment of possible selection bias and improve reproducibility.
  2. [Table or figure reporting PoC success rates] Table or figure reporting PoC success rates: Clarify whether the 67.5% figure reflects fully autonomous runs or includes any human-assisted retries, and ensure the comparison to baselines uses identical model versions and temperature settings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. We address the major comment below.

read point-by-point responses
  1. Referee: The central claim that SLYP 'autonomously' discovers and verifies vulnerabilities in production services rests on the tool interfaces supplying sufficient static context, activation metadata, and runtime feedback. Given the benchmark result of only 67.5% verified PoC success even under the strongest configuration (and 0% for default agents), the manuscript should explicitly state the human-intervention protocol used for the 28 reported cases—e.g., whether any post-generation filtering, manual steering, or selective reporting occurred—so that readers can assess whether the end-to-end autonomy metric generalizes from benchmark to live services.

    Authors: We agree that explicit clarification of the deployment protocol is warranted to support the autonomy claim and allow readers to evaluate generalizability. In the revised manuscript we will expand the Deployment on Production Services section with a new paragraph stating the following: SLYP was executed end-to-end in fully automated mode using the complete set of binary-exploration, COM-inspection, and dynamic-debugging tool interfaces on each of the nine production services. No manual steering, post-generation filtering, or selective reporting occurred; the 28 vulnerabilities comprise every case in which the agent autonomously identified a race condition and synthesized a debugger-verified PoC that was subsequently confirmed by MSRC. This protocol is identical to the benchmark evaluation, where only verified PoCs are counted toward the 67.5% success rate. The lower benchmark success rate simply reflects the inherent difficulty of the task; deployment results report all confirmed autonomous successes without additional human curation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from tool execution and external confirmation

full rationale

The paper presents an agentic pipeline (SLYP) with reusable tool interfaces for binary exploration, COM inspection, and dynamic debugging. All reported metrics (0.973 F1 on benchmark, 67.5% PoC success, 28 MSRC-confirmed vulnerabilities with CVEs and bounties) are direct empirical outcomes from running the system on fixed test cases and live services. No equations, fitted parameters, or first-principles derivations are claimed; results do not reduce to self-definition, renamed inputs, or self-citation chains. External MSRC confirmation and benchmark comparisons provide independent validation outside the pipeline's internal state.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems and security-engineering paper. No mathematical derivations, fitted constants, or theoretical axioms are invoked; the central claims rest on observed agent performance and external MSRC validation rather than on any postulated entities or free parameters.

pith-pipeline@v0.9.0 · 5578 in / 1344 out tokens · 105418 ms · 2026-05-08T16:51:20.430273+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    San Francisco, CA

    2023.Proceedings of the 44th IEEE Symposium on Security and Privacy (Oakland). San Francisco, CA

  2. [2]

    San Diego, CA

    2025.Proceedings of the 32nd Annual Network and Distributed System Security Symposium (NDSS). San Diego, CA

  3. [3]

    Talor Abramovich, Meet Udeshi, Minghao Shao, Kilian Lieret, Haoran Xi, Kim- berly Milner, Sofija Jancheska, John Yang, Carlos E Jimenez, Farshad Khorrami, et al. 2025. EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=Of3wZhVv1R

  4. [4]

    Angr Project. [n. d.]. angr: A Binary Analysis Framework. https://angr.io/

  5. [5]

    Anthropic. 2024. Model Context Protocol (MCP). https://www.anthropic.com/ news/model-context-protocol

  6. [6]

    Anthropic. 2025. Claude Code. https://claude.com/product/claude-code

  7. [7]

    Abhishek Arya, Oliver Chang, Jonathan Metzman, Kostya Serebryany, and Dongge Liu. 2016. OSS-Fuzz. https://github.com/google/oss-fuzz. https: //github.com/google/oss-fuzz

  8. [8]

    Big Sleep team. 2024. From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities in Real-World Code. https://googleprojectzero.blogspot. com/2024/10/from-naptime-to-big-sleep.html. Blog post

  9. [9]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models Are Few-Shot Learners.Advances in neural information processing systems33 (2020), 1877–1901

  10. [10]

    Xiang Chen, Anshunkang Zhou, Chengfeng Ye, and Charles Zhang. 2025. Clear- Agent: Agentic Binary Analysis for Effective Vulnerability Detection. InACM SIGPLAN International Workshop on Language Models and Programming Lan- guages (LMPL). https://doi.org/10.1145/3759425.3763397

  11. [11]

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. InProceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435

  12. [12]

    Context length alone hurts llm performance despite perfect retrieval.arXiv preprint arXiv:2510.05381, 2025

    Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A. Huerta, and Hao Peng. 2025. Context Length Alone Hurts LLM Performance Despite Perfect Retrieval.arXiv preprint arXiv:2510.05381(2025). https://arxiv.org/abs/2510.05381 Accepted at the Findings of EMNLP 2025

  13. [13]

    Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. 2024. LLM Agents can Autonomously Exploit One-day Vulnerabilities.arXiv preprint arXiv:2404.08144 (2024)

  14. [14]

    James Forshaw. 2016. OleView.NET. https://github.com/tyranid/oleviewdotnet

  15. [15]

    Google Project Zero. 2024. Project Naptime: Evaluating Offensive Security Capabilities of Large Language Models. https://googleprojectzero.blogspot.com/ 2024/06/project-naptime.html. Blog post

  16. [16]

    Fangming Gu, Qingli Guo, Lian Li, Zhiniang Peng, Wei Lin, Xiaobo Yang, and Xiaorui Gong. 2022. COMRace: detecting data race vulnerabilities in COM objects. InProceedings of the 31st USENIX Security Symposium (Security). Boston, MA

  17. [17]

    HyungSeok Han, JeongOh Kyea, Yonghwi Jin, Jinoh Kang, Brian Pak, and Insu Yun. 2023. QueryX: Symbolic Query on Decompiled Code for Finding Bugs in COTS Binaries, See [1]

  18. [18]

    Hex-Rays. 2025. IDA Pro. https://hex-rays.com/ida-pro

  19. [19]

    2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance

    Kelly Hong, Anton Troynikov, and Jeff Huber. 2025.Context Rot: How Increasing Input Tokens Impacts LLM Performance. Technical Report. Chroma. https: //research.trychroma.com/context-rot

  20. [20]

    Mengkang Hu, Yao Mu, Xinmiao Chelsey Yu, Mingyu Ding, Shiguang Wu, Wenqi Shao, Qiguang Chen, Bin Wang, Yu Qiao, and Ping Luo. 2024. Tree-Planner: Efficient Close-loop Task Planning with Large Language Models. InThe Twelfth International Conference on Learning Representations. https://openreview.net/ forum?id=Glcsog6zOe

  21. [21]

    Peiwei Hu, Ruigang Liang, and Kai Chen. 2024. DeGPT: Optimizing Decompiler Output with LLM. InProceedings of the 31st Annual Network and Distributed Sys- tem Security Symposium (NDSS). San Diego, CA. https://www.ndss-symposium. org/wp-content/uploads/2024-401-paper.pdf

  22. [22]

    Dae R Jeong, Kyungtae Kim, Basavesh Shivakumar, Byoungyoung Lee, and Insik Shin. 2019. Razzer: Finding Kernel Race Bugs through Fuzzing. InProceedings of the 40th IEEE Symposium on Security and Privacy (Oakland). San Francisco, CA

  23. [23]

    Linxi Jiang, Xin Jin, and Zhiqiang Lin. 2025. Beyond Classification: Inferring Function Names in Stripped Binaries via Domain Adapted LLMs, See [2]. https: //www.ndss-symposium.org/wp-content/uploads/2025-797-paper.pdf

  24. [24]

    Taesoo Kim, HyungSeok Han, Soyeon Park, Dae R Jeong, Dohyeok Kim, Dongk- wan Kim, Eunsoo Kim, Jiho Kim, Joshua Wang, Kangsu Kim, et al . 2025. AT- LANTIS: AI-driven Threat Localization, Analysis, and Triage Intelligence System. arXiv preprint arXiv:2509.14589(2025)

  25. [25]

    Ahmed Lekssays, Hamza Mouhcine, Khang Tran, Ting Yu, and Issa Khalil. 2025. LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models. InProceedings of the 34th USENIX Security Symposium (Security). Seattle, WA

  26. [26]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173

  27. [27]

    Puzhuo Liu, Chengnian Sun, Yaowen Zheng, Xuan Feng, Chuan Qin, Yuncheng Wang, Zhenyang Xu, Zhi Li, Peng Di, Yu Jiang, et al. 2025. LLM-Powered Static Bi- nary Taint Analysis.ACM Transactions on Software Engineering and Methodology 34, 3 (2025), 1–36

  28. [28]

    Zhengxiong Luo, Huan Zhao, Dylan Wolff, Cristian Cadar, and Abhik Roy- choudhury. 2026. Agentic Concolic Execution. InProceedings of the 47th IEEE Symposium on Security and Privacy (Oakland). San Francisco, CA. https: //doi.ieeecomputersociety.org/10.1109/SP63933.2026.00003

  29. [29]

    Yougang Lyu, Lingyong Yan, Shuaiqiang Wang, Haibo Shi, Dawei Yin, Pengjie Ren, Zhumin Chen, Maarten de Rijke, and Zhaochun Ren. 2024. KnowTun- ing: Knowledge-aware Fine-Tuning for Large Language Models.arXiv preprint arXiv:2402.11176(2024)

  30. [30]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locat- ing and Editing Factual Associations in GPT.Advances in neural information processing systems35 (2022), 17359–17372

  31. [31]

    Accessed: 2026-04-08

    Microsoft. Accessed: 2026-04-08. Application Verifier. https://learn.microsoft. com/en-us/windows-hardware/drivers/devtest/application-verifier

  32. [32]

    Accessed: 2026-04-08

    Microsoft. Accessed: 2026-04-08. Component Object Model (COM). https://learn.microsoft.com/en-us/windows/win32/com/ component-object-model--com--portal

  33. [33]

    Vikram Nitin, Baishakhi Ray, and Roshanak Zilouchian Moghaddam. 2025. Fault- Line: Automated Proof-of-Vulnerability Generation Using LLM Agents.arXiv preprint arXiv:2507.15241(2025)

  34. [34]

    OpenAI. 2025. Codex. https://openai.com/codex/

  35. [35]

    OpenAI. 2025. Codex CLI. https://developers.openai.com/codex/cli

  36. [36]

    OpenHands. 2025. OpenHands Context Condensensation for More Efficient AI Agents. https://openhands.dev/blog/ openhands-context-condensensation-for-more-efficient-ai-agents

  37. [37]

    Chengbin Pang, Ruotong Yu, Yaohui Chen, Eric Koskinen, Georgios Portokalidis, Bing Mao, and Jun Xu. 2021. SoK: All You Ever Wanted to Know About x86/x64 Binary Disassembly But Were Afraid to Ask. InProceedings of the 42nd IEEE Symposium on Security and Privacy (Oakland). Virtual

  38. [38]

    Andre Pawlowski, Moritz Contag, Victor van der Veen, Chris Ouwehand, Thorsten Holz, Herbert Bos, Elias Athanasopoulos, and Cristiano Giuffrida. 2017. MARX: Uncovering Class Hierarchies in C++ Programs. InProceedings of the 24th Annual Network and Distributed System Security Symposium (NDSS). San Diego, CA. https://www.ndss-symposium.org/wp-content/uploads...

  39. [39]

    Wanzong Peng, Lin Ye, Xuetao Du, Hongli Zhang, Dongyang Zhan, Yunting Zhang, Yicheng Guo, and Chen Zhang. 2025. PwnGPT: Automatic Exploit Gen- eration Based on Large Language Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 11481–11494

  40. [40]

    2009.The Probabilistic Relevance Frame- work: BM25 and Beyond

    Stephen Robertson and Hugo Zaragoza. 2009.The Probabilistic Relevance Frame- work: BM25 and Beyond. Vol. 4. Now Publishers Inc

  41. [41]

    Joschua Schilling, Andreas Wendler, Philipp Görz, Nils Bars, Moritz Schloegel, and Thorsten Holz. 2024. A Binary-level Thread Sanitizer or Why Sanitiz- ing on the Binary Level is Hard. InProceedings of the 33rd USENIX Security Symposium (Security). Philadelphia, PA. https://www.usenix.org/system/files/ sec24fall-prepub-921-schilling.pdf

  42. [42]

    Schwartz, Cory F

    Edward J. Schwartz, Cory F. Cohen, Michael Duggan, Jeffrey Gennari, Jeffrey S. Havrilla, and Charles Hines. 2018. Using Logic Programming to Recover C++ Classes and Methods from Compiled Executables. InProceedings of the 25th ACM Conference on Computer and Communications Security (CCS). Toronto, ON, Canada. https://doi.org/10.1145/3243734.3243793

  43. [43]

    Xiuwei Shang, Shaoyin Cheng, Guoqiang Chen, Yanming Zhang, Li Hu, Xiao Yu, Gangyang Li, Weiming Zhang, and Nenghai Yu. 2024. How Far Have We Gone in Binary Code Understanding Using Large Language Models. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 1–12

  44. [44]

    Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, et al. 2016. SoK: (State of) The Art of War: Offensive Techniques in Binary Analysis. InProceedings of the 37th IEEE Symposium on Security and Privacy (Oakland). San Jose, CA. https://sites.cs.ucs...

  45. [45]

    Hanzhuo Tan, Qi Luo, Jing Li, and Yuqun Zhang. 2024. LLM4Decompile: Decom- piling Binary Code with Large Language Models.arXiv preprint arXiv:2403.05286 (2024)

  46. [46]

    Aishwarya Upadhyay, Vijay Laxmi, and Smita Naval. 2023. Navigating the Concurrency Landscape: A Survey of Race Condition Vulnerability Detectors. arXiv preprint arXiv:2312.14479(2023)

  47. [47]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang 13 Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform fo...

  48. [48]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.Advances in Neural Information Processing Systems35 (2022), 24824–24837

  49. [49]

    Wai Kin Wong, Daoyuan Wu, Huaijin Wang, Zongjie Li, Zhibo Liu, Shuai Wang, Qiyi Tang, Sen Nie, and Shi Wu. 2025. DecLLM: LLM-Augmented Recompilable Decompilation for Enabling Programmatic Use of Decompiled Code.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 1841–1864

  50. [50]

    Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4All: Universal Fuzzing with Large Language Models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  51. [51]

    Chunqiu Steven Xia and Lingming Zhang. 2024. Automated Program Repair via Conversation: Fixing 162 out of 337 Bugs for $0.42 Each using ChatGPT. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 819–831

  52. [52]

    Yuan-An Xiao, Pengfei Gao, Chao Peng, and Yingfei Xiong. 2025. Improving the Efficiency of LLM Agent Systems through Trajectory Reduction.arXiv preprint arXiv:2509.23586(2025)

  53. [53]

    Danning Xie, Zhuo Zhang, Nan Jiang, Xiangzhe Xu, Lin Tan, and Xiangyu Zhang

  54. [54]

    InProceedings of the 31st ACM Conference on Computer and Communications Security (CCS)

    ReSym: Harnessing LLMs to Recover Variable and Data Structure Symbols from Stripped Binaries. InProceedings of the 31st ACM Conference on Computer and Communications Security (CCS). Salt Lake City, UT. https://www.cs.purdue. edu/homes/lintan/publications/resym-ccs24.pdf

  55. [55]

    Xiangzhe Xu, Zhuo Zhang, Zian Su, Ziyang Huang, Shiwei Feng, Yapeng Ye, Nan Jiang, Danning Xie, Siyuan Cheng, Lin Tan, et al. 2025. Unleashing the Power of Generative Model in Recovering Variable Names from Stripped Binary, See [2]

  56. [56]

    Chenyuan Yang, Zijie Zhao, Zichen Xie, Haoyu Li, and Lingming Zhang. 2025. KNighter: Transforming Static Analysis with LLM-Synthesized Checkers. InPro- ceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles(Lotte Hotel World, Seoul, Republic of Korea)(SOSP ’25). Association for Computing Machinery, New York, NY, USA, 655–669. doi:10.114...

  57. [57]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793(2024)

  58. [58]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations

  59. [59]

    Yuxing Zhang, Xiaogang Zhu, Daojing He, Minhui Xue, Shouling Ji, Moham- mad Sayad Haghighi, Sheng Wen, and Zhiniang Peng. 2023. Detecting Union Type Confusion in Component Object Model. InProceedings of the 32nd USENIX Secu- rity Symposium (Security). Anaheim, CA. https://www.usenix.org/conference/ usenixsecurity23/presentation/zhang-yuxing

  60. [60]

    Zijie Zhao, Chenyuan Yang, Weidong Wang, Yihan Yang, Ziqi Zhang, and Ling- ming Zhang. 2026. AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection.arXiv preprint arXiv:2604.11950(2026)

  61. [61]

    Wenyu Zhu, Zhiyao Feng, Zihan Zhang, Jianjun Chen, Zhijian Ou, Min Yang, and Chao Zhang. 2023. CALLEE: Recovering Call Graphs for Binaries with Transfer and Contrastive Learning, See [1]. https://www.jianjunchen.com/p/callee.sp23. pdf A PROMPT TEMPLATE /userUser Prompt # Race Condition Vulnerability Analysis You are a professional security researcher anal...