pith. sign in

arxiv: 2605.15503 · v1 · pith:R4YIVGVCnew · submitted 2026-05-15 · 💻 cs.CR

uGen: An Agentic Framework for Generating Microarchitectural Attack PoCs

Pith reviewed 2026-05-19 15:42 UTC · model grok-4.3

classification 💻 cs.CR
keywords microarchitectural attacksLLM code generationSpectre attackPrime+Probeproof of conceptmulti-agent frameworkvulnerability assessmentcache attacks
0
0 comments X

The pith

uGen generates functionally correct microarchitectural attack PoCs by using multi-agent retrieval to fill LLM knowledge gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces uGen as a framework to automate the creation of proof-of-concept code for microarchitectural attacks such as Spectre and Prime+Probe. It begins by systematically studying how current LLMs like GPT, Claude, and Qwen3 fail to generate correct attack primitives. Then it deploys a retrieval-augmented multi-agent system to inject the missing knowledge and produce code tailored to specific defender needs. This matters because manual PoC development is labor-intensive and lacks portability, limiting broad vulnerability assessment in processors. Evaluation shows the approach works across different models and hardware with high success rates and low cost.

Core claim

uGen is the first LLM-driven framework for automated microarchitectural attack code generation. A systematic study reveals that LLMs frequently misgenerate or misplace critical attack primitives. Guided by this, uGen uses a retrieval-augmented, multi-agent design to inject missing domain knowledge and synthesize functionally correct PoCs for cache-based and speculative-execution attacks tailored to defender requirements across diverse microarchitectures and vulnerable functions.

What carries the argument

Retrieval-augmented multi-agent design for injecting missing attack primitives into LLMs to generate correct PoC code.

If this is right

  • Up to 100% success rate for generating Spectre-v1 PoCs using Claude Sonnet-4.
  • 80% success rate for Prime+Probe PoCs using Qwen3-Coder.
  • Successful PoC generation at a cost of $1.25 in under four minutes.
  • Applicable across a diverse set of microarchitectures, vulnerable functions, and execution environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenders could use similar frameworks to quickly test new processor models for emerging attack vectors.
  • Extending the systematic gap analysis to additional attack types like rowhammer could broaden automated security testing.
  • The low cost suggests potential for integration into continuous integration pipelines for hardware security validation.

Load-bearing premise

The method of studying LLM gaps and using retrieval-augmented agents will consistently yield working and portable attack code for different processors and settings.

What would settle it

Generating PoC code with uGen and then running it on a specific vulnerable CPU to check if it successfully demonstrates the attack as expected.

Figures

Figures reproduced from arXiv: 2605.15503 by Berk Gulmezoglu, Debopriya Roy Dipta, Eduard Marin, Thomas Eisenbarth, Thore Tiemann.

Figure 1
Figure 1. Figure 1: Multi-agent architecture of uGen. The architecture consists of (S1) Knowledge Gap Profiler, (S2) RAG-Document Generator, (S3) RAG Validation & Refinement, and (S4) Deployment stage. fails, a RAG document is generated based on the ground truth implementation, failed attempts, and validation criteria. Each RAG document is constructed with the same level of details: 1) the significance of the attack metric fo… view at source ↗
Figure 2
Figure 2. Figure 2: Multi-stage workflow of uGen: S1 (Knowledge Gap Profiler) identifies overlooked, misgenerated, or misplaced attack attributes in LLM-generated PoC code; S2 (RAG Document Generator) generates domain-specific details for attack attributes; S3 (RAG Validation & Refinement) validates and refines generated RAG documents; S4 (Deployment) refers to the final deployed stage for the end-user. window is clearly emph… view at source ↗
Figure 3
Figure 3. Figure 3: Knowledge Gap Profiler results showing the success rate of correctly implementing each metric in PoC codes across [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
read the original abstract

Microarchitectural attacks continue to evolve, uncovering new exploitation vectors in modern processors. From a defensive perspective, assessing a system's susceptibility to such attacks remains challenging. Developing functional attack implementations is labor-intensive, requires deep microarchitectural expertise, and is highly sensitive to execution environments. Consequently, existing attacks often lack portability, limiting systematic and scalable vulnerability assessment. Recent advances in large language models (LLMs) suggest a potential avenue for lowering these barriers. However, it remains unclear whether LLMs can reliably generate functionally correct microarchitectural attack code suitable for rigorous vulnerability testing. In this work, we present uGen, the first LLM-driven framework for automated microarchitectural attack code generation. A key challenge we address is identifying attack-specific knowledge gaps in LLMs. Through a systematic study of state-of-the-art models (GPT, Claude, and Qwen3), we find that LLMs frequently misgenerate or misplace critical attack primitives. Guided by this analysis, uGen employs a retrieval-augmented, multi-agent design that injects missing domain knowledge to synthesize functionally correct microarchitectural attack PoCs tailored to defender requirements. We evaluate uGen on cache-based and speculative-execution attacks across diverse set of microarchitectures, vulnerable functions, and LLM platforms. In the deployment stage, uGen achieves up to 100% success rate for Spectre-v1 (Claude Sonnet-4) and 80% for Prime+Probe (Qwen3-Coder). Finally, we demonstrate that uGen can generate a successful PoC code with a cost of $1.25 in under four minutes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces uGen, a retrieval-augmented multi-agent LLM framework designed to generate functionally correct microarchitectural attack PoCs (e.g., Spectre-v1 and Prime+Probe) by first systematically identifying knowledge gaps in models like GPT, Claude, and Qwen3, then injecting missing attack primitives. It evaluates the system across multiple LLMs, microarchitectures, and vulnerable functions, reporting success rates up to 100% for Spectre-v1 (Claude Sonnet-4) and 80% for Prime+Probe (Qwen3-Coder) in a deployment stage, along with a demonstration of low-cost generation ($1.25 in under four minutes).

Significance. If the reported success rates reflect verified microarchitectural side effects (such as measurable speculative leakage or cache timing differentials) rather than syntactic or runtime validity alone, uGen could meaningfully reduce the expertise barrier for defenders performing portable vulnerability assessments. The systematic gap analysis and multi-agent design provide a concrete, reproducible method that could be extended to other side-channel attacks; the cross-model and cross-architecture evaluation adds practical value if the functional-correctness claims are substantiated.

major comments (2)
  1. [Abstract] Abstract and Evaluation section: The success rates (100% for Spectre-v1, 80% for Prime+Probe) are presented without an explicit definition of the success metric or verification procedure. It is unclear whether a PoC is deemed successful upon compilation, execution without crash, or only after confirming actual attack effects (e.g., branch-predictor mistraining leakage exceeding noise for Spectre-v1 or statistically significant cache-eviction timing for Prime+Probe). This directly affects whether the central claim of producing 'functionally correct' PoCs for vulnerability assessment holds.
  2. [Evaluation] Evaluation section: No details are supplied on controls for environmental sensitivity, hardware-specific timing noise, statistical significance testing, or how portability across microarchitectures was validated. Without these, the high success rates cannot be distinguished from basic executability, weakening the evidence that the retrieval-augmented multi-agent approach reliably produces usable attack code.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a brief table summarizing the exact attack primitives injected by the retrieval component for each evaluated attack.
  2. [Section 3] Notation for agent roles and retrieval sources could be made more consistent across figures and text to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our paper. The comments raise valid points regarding the clarity of our success metrics and evaluation details. We address each major comment below and will revise the manuscript accordingly to improve transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Evaluation section: The success rates (100% for Spectre-v1, 80% for Prime+Probe) are presented without an explicit definition of the success metric or verification procedure. It is unclear whether a PoC is deemed successful upon compilation, execution without crash, or only after confirming actual attack effects (e.g., branch-predictor mistraining leakage exceeding noise for Spectre-v1 or statistically significant cache-eviction timing for Prime+Probe). This directly affects whether the central claim of producing 'functionally correct' PoCs for vulnerability assessment holds.

    Authors: We acknowledge that the definition of success was not sufficiently explicit in the abstract and evaluation sections. In our experiments, a PoC is considered successful only if it produces the expected microarchitectural side effect, verified by measuring actual leakage or timing differentials that exceed noise thresholds, rather than just successful compilation or execution. We will revise the manuscript to include an explicit definition of the success metric and a detailed description of the verification procedure in the Evaluation section. revision: yes

  2. Referee: [Evaluation] Evaluation section: No details are supplied on controls for environmental sensitivity, hardware-specific timing noise, statistical significance testing, or how portability across microarchitectures was validated. Without these, the high success rates cannot be distinguished from basic executability, weakening the evidence that the retrieval-augmented multi-agent approach reliably produces usable attack code.

    Authors: We agree that more details on experimental controls are necessary to substantiate the claims. In the revised version, we will add information on how we controlled for environmental factors, such as using isolated execution environments, performing multiple runs to account for timing noise, applying statistical significance tests (e.g., t-tests on timing measurements), and validating portability by testing on multiple microarchitectures with documented hardware specifications and any necessary code adaptations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation

full rationale

The paper describes an empirical LLM-based framework for generating microarchitectural attack PoCs. Success rates (e.g., 100% for Spectre-v1) are measured by executing and testing generated code on target hardware for functional correctness and side-channel effects, rather than being defined internally or fitted to inputs. The systematic study of LLM gaps and the retrieval-augmented multi-agent design are presented as engineering choices validated externally through experiments across models and microarchitectures, with no equations, self-definitional loops, or load-bearing self-citations that reduce the central results to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; full details unavailable. Core assumptions concern LLM limitations in attack primitives and the corrective power of retrieval augmentation.

axioms (2)
  • domain assumption LLMs frequently misgenerate or misplace critical attack primitives that can be systematically identified.
    Presented as the key challenge guiding the framework design.
  • domain assumption Retrieval-augmented multi-agent design can supply the missing domain knowledge to produce correct PoCs.
    Central mechanism of uGen.

pith-pipeline@v0.9.0 · 5839 in / 1345 out tokens · 64422 ms · 2026-05-19T15:42:18.678473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 5 internal anchors

  1. [1]

    Introducing Claude 4

    Anthropic. Introducing Claude 4. https://www.anthropic.com/news/ claude-4, 2025. Accessed 2026-01-29

  2. [2]

    Anthropic’s transparency hub

    Anthropic. Anthropic’s transparency hub. https://www.anthropic.com/ transparency, 2026. Accessed 2026-01-29

  3. [3]

    Branch history injection: On the effectiveness of hardware mitigations against cross-privilege Spectre-v2 attacks

    Enrico Barberis, Pietro Frigo, Marius Muench, Herbert Bos, and Cris- tiano Giuffrida. Branch history injection: On the effectiveness of hardware mitigations against cross-privilege Spectre-v2 attacks. In Kevin R. B. Butler and Kurt Thomas, editors,31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022, pages 971–988. USE...

  4. [4]

    SMoTherSpectre: Exploiting speculative execution through port contention

    Atri Bhattacharyya, Alexandra Sandulescu, Matthias Neugschwandtner, Alessandro Sorniotti, Babak Falsafi, Mathias Payer, and Anil Kur- mus. SMoTherSpectre: Exploiting speculative execution through port contention. In Lorenzo Cavallaro, Johannes Kinder, XiaoFeng Wang, and Jonathan Katz, editors,Proceedings of the 2019 ACM SIGSAC Conference on Computer and C...

  5. [5]

    Rae, Erich Elsen, and Laurent Sifre

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean- Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Or...

  6. [6]

    Aim, wait, shoot: How the CacheSniper technique improves unprivileged cache attacks

    Samira Briongos, Ida Bruhns, Pedro Malag ´on, Thomas Eisenbarth, and Jos´e Manuel Moya. Aim, wait, shoot: How the CacheSniper technique improves unprivileged cache attacks. InIEEE European Symposium on Security and Privacy, EuroS&P 2021, Vienna, Austria, September 6-10, 2021, pages 683–700. IEEE, 2021

  7. [7]

    Unprecedented code change automation: The fusion of LLMs and transformation by example.Proc

    Malinda Dilhara, Abhiram Bellur, Timofey Bryksin, and Danny Dig. Unprecedented code change automation: The fusion of LLMs and transformation by example.Proc. ACM Softw. Eng., 1(FSE):631–653, 2024

  8. [8]

    De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding.arXiv preprint arXiv:2401.01701,

    Aryaz Eghbali and Michael Pradel. De-Hallucinator: Iterative grounding for LLM-based code completion.CoRR, abs/2401.01701, 2024. 14

  9. [9]

    LLM Agents can Autonomously Exploit One-day Vulnerabilities

    Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. LLM agents can autonomously exploit one-day vulnerabilities.CoRR, abs/2404.08144, 2024

  10. [10]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. CoRR, abs/2312.10997, 2023

  11. [11]

    Au- topentester: An LLM agent-based framework for automated pentesting

    Yasod Ginige, Akila Niroshan, Sajal Jain, and Suranga Seneviratne. Au- topentester: An LLM agent-based framework for automated pentesting. In24th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2025, Guiyang, China, November 14-17, 2025, pages 163–174. IEEE, 2025

  12. [12]

    Flush+Flush: A fast and stealthy cache attack

    Daniel Gruss, Cl ´ementine Maurice, Klaus Wagner, and Stefan Mangard. Flush+Flush: A fast and stealthy cache attack. In Juan Caballero, Urko Zurutuza, and Ricardo J. Rodr´ıguez, editors,Detection of Intrusions and Malware, and Vulnerability Assessment - 13th International Conference, DIMVA 2016, San Sebasti´an, Spain, July 7-8, 2016, Proceedings, volume 9...

  13. [13]

    Cache template attacks: Automating attacks on inclusive last-level caches

    Daniel Gruss, Raphael Spreitzer, and Stefan Mangard. Cache template attacks: Automating attacks on inclusive last-level caches. In Jaeyeon Jung and Thorsten Holz, editors,24th USENIX Security Symposium, USENIX Security 15, Washington, D.C., USA, August 12-14, 2015, pages 897–912. USENIX Association, 2015

  14. [14]

    Cross-VM cache attacks on AES.IEEE Trans

    Berk G ¨ulmezoglu, Mehmet Sinan Inci, Gorka Irazoqui, Thomas Eisen- barth, and Berk Sunar. Cross-VM cache attacks on AES.IEEE Trans. Multi Scale Comput. Syst., 2(3):211–222, 2016

  15. [15]

    Hennessy and David A

    John L. Hennessy and David A. Patterson.Computer Architecture - A Quantitative Approach, 5th Edition. Morgan Kaufmann, 2012

  16. [16]

    Rain: Transiently leaking data from public clouds using old vulnerabilities

    Math ´e Hertogh, Dave Quakkelaar, Thijs Raymakers, Mahesh Hari Sarma, Marius Muench, Herbert Bos, and Erik van der Kouwe. Rain: Transiently leaking data from public clouds using old vulnerabilities. In IEEE Symposium on Security and Privacy, SP 2026, San Francisco, CA, USA, May 18-21, 206. IEEE, 2026. To be published

  17. [17]

    Metagpt: Meta programming for A multi- agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J ¨urgen Schmidhuber. Metagpt: Meta programming for A multi- agent collaborative framework. InThe Twelfth International Conference on Learning Representations, ...

  18. [18]

    OpenReview.net, 2024

  19. [19]

    speculative execution, variant 4: speculative store bypass

    Jann Horn. speculative execution, variant 4: speculative store bypass. https://project-zero.issues.chromium.org/issues/42450580, 2018

  20. [20]

    InferFix: End-to-end program repair with LLMs

    Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. InferFix: End-to-end program repair with LLMs. In Satish Chandra, Kelly Blincoe, and Paolo Tonella, editors,Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 20...

  21. [21]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 1...

  22. [22]

    Spectre mitigations in Microsoft’s C/C++ compiler

    Paul Kocher. Spectre mitigations in Microsoft’s C/C++ compiler. https: //www.paulkocher.com/doc/MicrosoftCompilerSpectreMitigation.html,

  23. [23]

    Spectre attacks: Exploit- ing speculative execution

    Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom. Spectre attacks: Exploit- ing speculative execution. In2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pages 1–19. IEEE, 2019

  24. [24]

    Khasawneh, Chengyu Song, and Nael B

    Esmaeil Mohammadian Koruyeh, Khaled N. Khasawneh, Chengyu Song, and Nael B. Abu-Ghazaleh. Spectre returns! speculation attacks using the return stack buffer. In Christian Rossow and Yves Younan, editors,12th USENIX Workshop on Offensive Technologies, WOOT 2018, Baltimore, MD, USA, August 13-14, 2018. USENIX Association, 2018

  25. [25]

    Retrieval- augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval- augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin...

  26. [26]

    Fangfei Liu, Yuval Yarom, Qian Ge, Gernot Heiser, and Ruby B. Lee. Last-level cache side-channel attacks are practical. In2015 IEEE Symposium on Security and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015, pages 605–622. IEEE Computer Society, 2015

  27. [27]

    ret2spec: Speculative exe- cution using return stack buffers

    Giorgi Maisuradze and Christian Rossow. ret2spec: Speculative exe- cution using return stack buffers. In David Lie, Mohammad Mannan, Michael Backes, and XiaoFeng Wang, editors,Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS 2018, Toronto, ON, Canada, October 15-19, 2018, pages 2109–

  28. [28]

    Debugging with open-source large language models: An evaluation

    Yacine Majdoub and Eya Ben Charrada. Debugging with open-source large language models: An evaluation. In Xavier Franch, Maya Daneva, Silverio Mart´ınez-Fern´andez, and Luigi Quaranta, editors,Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2024, Barcelona, Spain, October 24-25, 2024, pages 5...

  29. [29]

    AutoPen: Towards autonomous penetration testing using LLM-powered agents

    Jiahao Mei, Shuangwu Chen, Yuanyi Ma, and Huizi Song. AutoPen: Towards autonomous penetration testing using LLM-powered agents. In Proceedings of the 9th International Conference on Computer Science and Application Engineering, CSAE 2025, Shanghai, China October 19- 21, 2025, pages 1–6. ACM, 2025

  30. [30]

    Hacksynth: Llm agent and evaluation framework for autonomous penetration testing.arXiv preprint arXiv:2412.01778, 2024

    Lajos Muzsai, David Imolai, and Andr ´as Luk ´acs. HackSynth: LLM agent and evaluation framework for autonomous penetration testing. CoRR, abs/2412.01778, 2024

  31. [31]

    GPT-4o system card

    OpenAI. GPT-4o system card. https://cdn.openai.com/ gpt-4o-system-card.pdf, 2024. Accessed 2026-01-29

  32. [32]

    Introducing SWE-bench Verified

    OpenAI. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/, 2024. Accessed 2026-01-29

  33. [33]

    Kemerlis, Simha Sethumadhavan, and An- gelos D

    Yossef Oren, Vasileios P. Kemerlis, Simha Sethumadhavan, and An- gelos D. Keromytis. The spy in the sandbox: Practical cache attacks in JavaScript and their implications. In Indrajit Ray, Ninghui Li, and Christopher Kruegel, editors,Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, October 12-16, 2015,...

  34. [34]

    Cache attacks and countermeasures: The case of AES

    Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache attacks and countermeasures: The case of AES. In David Pointcheval, editor, Topics in Cryptology - CT-RSA 2006, The Cryptographers’ Track at the RSA Conference 2006, San Jose, CA, USA, February 13-17, 2006, Proceedings, volume 3860 ofLecture Notes in Computer Science, pages 1–20. Springer, 2006

  35. [35]

    Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han

    Benji Peng, Ziqian Bi, Qian Niu, Ming Liu, Pohsun Feng, Tianyang Wang, Lawrence KQ Yan, Yizhu Wen, Yichao Zhang, and Caitlyn Heqi Yin. Jailbreaking and mitigation of vulnerabilities in large language models.CoRR, abs/2410.15236, 2024

  36. [36]

    The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on developer productivity: Evidence from GitHub Copilot. CoRR, abs/2302.06590, 2023

  37. [37]

    Prime+Scope: Overcoming the observer effect for high-precision cache contention attacks

    Antoon Purnal, Furkan Turan, and Ingrid Verbauwhede. Prime+Scope: Overcoming the observer effect for high-precision cache contention attacks. In Yongdae Kim, Jong Kim, Giovanni Vigna, and Elaine Shi, editors,CCS ’21: 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, November 15 - 19, 2021, pages 2906–292...

  38. [38]

    Chatdev: Communicative agents for software development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Li...

  39. [39]

    Qwen3-Coder: Agentic coding in the world

    Qwen Team. Qwen3-Coder: Agentic coding in the world. https://qwen. ai/blog?id=qwen3-coder, 2025. Accessed 2026-01-29

  40. [40]

    Augmenting code sequencing with retrieval-augmented generation (rag) for context-aware code synthesis

    S Jansi Rani, S G Deepika, D Devdharshini, and Harini Ravindran. Augmenting code sequencing with retrieval-augmented generation (rag) for context-aware code synthesis. In2024 First International Conference on Software, Systems and Information Technology (SSITCON), pages 1– 7, 2024. 15

  41. [41]

    Branch privilege injection: Compromising Spectre v2 hardware mitigations by exploit- ing branch predictor race conditions

    Sandro R ¨uegge, Johannes Wikner, and Kaveh Razavi. Branch privilege injection: Compromising Spectre v2 hardware mitigations by exploit- ing branch predictor race conditions. In Lujo Bauer and Giancarlo Pellegrino, editors,34th USENIX Security Symposium, USENIX Secu- rity 2025, Seattle, WA, USA, August 13-15, 2025, pages 2615–2631. USENIX Association, 2025

  42. [42]

    MalGEN: A Testbed for Modeling and Evaluating Malware Behaviors

    Bikash Saha and Sandeep Kumar Shukla. MalGEN: A generative agent framework for modeling malicious software in cybersecurity.CoRR, abs/2506.07586, 2025

  43. [43]

    CRAKEN: cybersecurity LLM agent with knowledge-based execution.CoRR, abs/2505.17107, 2025

    Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Cha- ran Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. CRAKEN: cybersecurity LLM agent with knowledge-based execution.CoRR, abs/2505.17107, 2025

  44. [44]

    Edward Suh, and Udit Gupta

    Michael Shen, Muhammad Umar, Kiwan Maeng, G. Edward Suh, and Udit Gupta. Towards understanding systems trade-offs in retrieval- augmented generation model inference.CoRR, abs/2412.11854, 2024

  45. [45]

    PentestAgent: Incor- porating LLM agents to automated penetration testing

    Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Wei Ruan. PentestAgent: Incor- porating LLM agents to automated penetration testing. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2025, Hanoi, Vietnam, August 25-29, 2025, pages 375–391. ACM, 2025

  46. [46]

    PoCGen: Generating proof-of-concept exploits for vulnerabilities in Npm packages.CoRR, abs/2506.04962, 2025

    Deniz Simsek, Aryaz Eghbali, and Michael Pradel. PoCGen: Generating proof-of-concept exploits for vulnerabilities in Npm packages.CoRR, abs/2506.04962, 2025

  47. [47]

    SMaCk: Efficient instruction cache attacks via self-modifying code conflicts

    Seonghun Son, Daniel Moghimi, and Berk G ¨ulmezoglu. SMaCk: Efficient instruction cache attacks via self-modifying code conflicts. In Lieven Eeckhout, Georgios Smaragdakis, Katai Liang, Adrian Sampson, Martha A. Kim, and Christopher J. Rossbach, editors,Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages...

  48. [48]

    D-CIPHER: Dynamic collaborative intel- ligent multi-agent system with planner and heterogeneous executors for offensive security.CoRR, abs/2502.10931, 2025

    Meet Udeshi, Minghao Shao, Haoran Xi, Nanda Rani, Kimberly Milner, Venkata Sai Charan Putrevu, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. D-CIPHER: Dynamic collaborative intel- ligent multi-agent system with planner and heterogeneous executors for offensive security.CoRR, abs...

  49. [49]

    From CVE entries to verifiable exploits: An automated multi- agent framework for reproducing CVEs.CoRR, abs/2509.01835, 2025

    Saad Ullah, Praneeth Balasubramanian, Wenbo Guo, Amanda Burnett, Hammond Pearce, Christopher Kruegel, Giovanni Vigna, and Gianluca Stringhini. From CVE entries to verifiable exploits: An automated multi- agent framework for reproducing CVEs.CoRR, abs/2509.01835, 2025

  50. [50]

    Sahraoui

    Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, and Houari A. Sahraoui. Exploring parameter-efficient fine-tuning techniques for code generation with large language models.ACM Trans. Softw. Eng. Methodol., 34(7):204:1–204:25, 2025

  51. [51]

    Training Solo: On the limita- tions of domain isolation against Spectre-v2 attacks

    Sander Wiebing and Cristiano Giuffrida. Training Solo: On the limita- tions of domain isolation against Spectre-v2 attacks. In Marina Blanton, William Enck, and Cristina Nita-Rotaru, editors,IEEE Symposium on Security and Privacy, SP 2025, San Francisco, CA, USA, May 12-15, 2025, pages 3599–3616. IEEE, 2025

  52. [52]

    RETBLEED: arbitrary speculative code execution with return instructions

    Johannes Wikner and Kaveh Razavi. RETBLEED: arbitrary speculative code execution with return instructions. In Kevin R. B. Butler and Kurt Thomas, editors,31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022, pages 3825–3842. USENIX Association, 2022

  53. [53]

    Autogen: Enabling next-gen LLM applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Has- san Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversations. In First Conference on Language Modeling, 2024

  54. [54]

    Hosein Yavarzadeh, Archit Agarwal, Max Christman, Christina Garman, Daniel Genkin, Andrew Kwong, Daniel Moghimi, Deian Stefan, Kazem Taram, and Dean M. Tullsen. Pathfinder: High-resolution control-flow attacks exploiting the conditional branch predictor. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Langua...

  55. [55]

    TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment

    Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, and Yiling Lou. TRANSAGENT: an LLM-based multi-agent system for code translation.CoRR, abs/2409.19894, 2024

  56. [56]

    A systematic study on generating web vulnerability proof-of-concepts using large language models.CoRR, abs/2510.10148, 2025

    Mengyao Zhao, Kaixuan Li, Lyuye Zhang, Wenjing Dang, Chenggong Ding, Sen Chen, and Zheli Liu. A systematic study on generating web vulnerability proof-of-concepts using large language models.CoRR, abs/2510.10148, 2025

  57. [57]

    forgetting

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm- as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural...

  58. [59]

    Craft Misprediction Conditions: Design [...]

  59. [60]

    This increases the likelihood of misprediction when the speculative path is executed

    Interleave with Legitimate Accesses: Mix legitimate accesses with speculative ones to train the branch predictor. This increases the likelihood of misprediction when the speculative path is executed

  60. [61]

    This should [...] Expert Feedback ADD the following details under the Implementation Guidelines: •Interleave safe and malicious index values within the same loop

    Ensure Speculative Execution: Use [...] Placement Guidance:Insert the controlled branch misprediction logic within the loop that prepares the speculative execution environment. This should [...] Expert Feedback ADD the following details under the Implementation Guidelines: •Interleave safe and malicious index values within the same loop. •Use branchless a...

  61. [62]

    Identify the Conditional Branch: Locate [...]

  62. [63]

    Interleave with Legitimate Accesses: Mix legitimate accesses with 17 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16 M17 M18 M19 M20 0 50 100 100 60 10 0 20 40 20 20 90 10 80 90 80 90 0 80 30 20 100 100100 100 80 80 80 80 100 90 90 80 50 90 0 0 0 90 60 30 90 90 100 60 50 60 70 70 70 70 70 70 70 90 40 40 0 40 80 10 100 100 Metric Success Rate (%) Cl...

  63. [64]

    It’s a secret!!

    Ensure Speculative Execution: Use [...] Placement Guidance:Insert the controlled branch misprediction logic within the loop that prepares the speculative execution environment. This should [...] •Insert this interleaving logic before the index is used as the input to a victim function. •This step must not be inside the victim function, as the attacker sho...