uGen: An Agentic Framework for Generating Microarchitectural Attack PoCs

Berk Gulmezoglu; Debopriya Roy Dipta; Eduard Marin; Thomas Eisenbarth; Thore Tiemann

arxiv: 2605.15503 · v1 · pith:R4YIVGVCnew · submitted 2026-05-15 · 💻 cs.CR

uGen: An Agentic Framework for Generating Microarchitectural Attack PoCs

Debopriya Roy Dipta , Thore Tiemann , Eduard Marin , Thomas Eisenbarth , Berk Gulmezoglu This is my paper

Pith reviewed 2026-05-19 15:42 UTC · model grok-4.3

classification 💻 cs.CR

keywords microarchitectural attacksLLM code generationSpectre attackPrime+Probeproof of conceptmulti-agent frameworkvulnerability assessmentcache attacks

0 comments

The pith

uGen generates functionally correct microarchitectural attack PoCs by using multi-agent retrieval to fill LLM knowledge gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces uGen as a framework to automate the creation of proof-of-concept code for microarchitectural attacks such as Spectre and Prime+Probe. It begins by systematically studying how current LLMs like GPT, Claude, and Qwen3 fail to generate correct attack primitives. Then it deploys a retrieval-augmented multi-agent system to inject the missing knowledge and produce code tailored to specific defender needs. This matters because manual PoC development is labor-intensive and lacks portability, limiting broad vulnerability assessment in processors. Evaluation shows the approach works across different models and hardware with high success rates and low cost.

Core claim

uGen is the first LLM-driven framework for automated microarchitectural attack code generation. A systematic study reveals that LLMs frequently misgenerate or misplace critical attack primitives. Guided by this, uGen uses a retrieval-augmented, multi-agent design to inject missing domain knowledge and synthesize functionally correct PoCs for cache-based and speculative-execution attacks tailored to defender requirements across diverse microarchitectures and vulnerable functions.

What carries the argument

Retrieval-augmented multi-agent design for injecting missing attack primitives into LLMs to generate correct PoC code.

If this is right

Up to 100% success rate for generating Spectre-v1 PoCs using Claude Sonnet-4.
80% success rate for Prime+Probe PoCs using Qwen3-Coder.
Successful PoC generation at a cost of $1.25 in under four minutes.
Applicable across a diverse set of microarchitectures, vulnerable functions, and execution environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenders could use similar frameworks to quickly test new processor models for emerging attack vectors.
Extending the systematic gap analysis to additional attack types like rowhammer could broaden automated security testing.
The low cost suggests potential for integration into continuous integration pipelines for hardware security validation.

Load-bearing premise

The method of studying LLM gaps and using retrieval-augmented agents will consistently yield working and portable attack code for different processors and settings.

What would settle it

Generating PoC code with uGen and then running it on a specific vulnerable CPU to check if it successfully demonstrates the attack as expected.

Figures

Figures reproduced from arXiv: 2605.15503 by Berk Gulmezoglu, Debopriya Roy Dipta, Eduard Marin, Thomas Eisenbarth, Thore Tiemann.

**Figure 1.** Figure 1: Multi-agent architecture of uGen. The architecture consists of (S1) Knowledge Gap Profiler, (S2) RAG-Document Generator, (S3) RAG Validation & Refinement, and (S4) Deployment stage. fails, a RAG document is generated based on the ground truth implementation, failed attempts, and validation criteria. Each RAG document is constructed with the same level of details: 1) the significance of the attack metric fo… view at source ↗

**Figure 2.** Figure 2: Multi-stage workflow of uGen: S1 (Knowledge Gap Profiler) identifies overlooked, misgenerated, or misplaced attack attributes in LLM-generated PoC code; S2 (RAG Document Generator) generates domain-specific details for attack attributes; S3 (RAG Validation & Refinement) validates and refines generated RAG documents; S4 (Deployment) refers to the final deployed stage for the end-user. window is clearly emph… view at source ↗

**Figure 3.** Figure 3: Knowledge Gap Profiler results showing the success rate of correctly implementing each metric in PoC codes across [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

read the original abstract

Microarchitectural attacks continue to evolve, uncovering new exploitation vectors in modern processors. From a defensive perspective, assessing a system's susceptibility to such attacks remains challenging. Developing functional attack implementations is labor-intensive, requires deep microarchitectural expertise, and is highly sensitive to execution environments. Consequently, existing attacks often lack portability, limiting systematic and scalable vulnerability assessment. Recent advances in large language models (LLMs) suggest a potential avenue for lowering these barriers. However, it remains unclear whether LLMs can reliably generate functionally correct microarchitectural attack code suitable for rigorous vulnerability testing. In this work, we present uGen, the first LLM-driven framework for automated microarchitectural attack code generation. A key challenge we address is identifying attack-specific knowledge gaps in LLMs. Through a systematic study of state-of-the-art models (GPT, Claude, and Qwen3), we find that LLMs frequently misgenerate or misplace critical attack primitives. Guided by this analysis, uGen employs a retrieval-augmented, multi-agent design that injects missing domain knowledge to synthesize functionally correct microarchitectural attack PoCs tailored to defender requirements. We evaluate uGen on cache-based and speculative-execution attacks across diverse set of microarchitectures, vulnerable functions, and LLM platforms. In the deployment stage, uGen achieves up to 100% success rate for Spectre-v1 (Claude Sonnet-4) and 80% for Prime+Probe (Qwen3-Coder). Finally, we demonstrate that uGen can generate a successful PoC code with a cost of $1.25 in under four minutes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

uGen is a practical multi-agent retrieval setup for LLM-generated microarch PoCs that fills some knowledge gaps, but the reported success rates need clearer validation that the code actually triggers measurable side effects on hardware.

read the letter

The paper's core contribution is a retrieval-augmented multi-agent framework that first maps where current LLMs fail on microarchitectural primitives like branch predictor mistraining or cache eviction sequences, then pulls in targeted knowledge to generate PoC code. They test this on Spectre-v1 and Prime+Probe across several models and microarchitectures, showing generation in under four minutes for roughly a dollar. That part is straightforward and useful for anyone who has spent time hand-writing these things and hitting environment-specific breakage. The systematic gap analysis is the piece that feels new compared to generic code-generation work. It gives concrete examples of what the models misplace or omit, which grounds the design choices. The low cost and speed numbers are also a plus if they hold up. The main soft spot is the success metric. The abstract claims up to 100% for Spectre-v1 and 80% for Prime+Probe, but it does not lay out how they confirmed functional correctness beyond compilation or basic execution. If success is only syntactic or runtime stability rather than verified timing differentials or actual data leakage on the target hardware, the numbers do not yet show that the generated code is ready for defender-style vulnerability assessment. The stress-test note captures this exactly. Without those measurements or controls for environmental sensitivity, the portability claim stays provisional. This work is aimed at hardware security people who need faster ways to prototype attacks for testing mitigations. A reader who already works on side-channel evaluation will get the most out of the gap study and the agent orchestration details. The paper deserves a serious referee because the idea is timely, the empirical setup is described enough to critique, and the results could be strengthened with clearer verification steps rather than rejected outright.

Referee Report

2 major / 2 minor

Summary. The paper introduces uGen, a retrieval-augmented multi-agent LLM framework designed to generate functionally correct microarchitectural attack PoCs (e.g., Spectre-v1 and Prime+Probe) by first systematically identifying knowledge gaps in models like GPT, Claude, and Qwen3, then injecting missing attack primitives. It evaluates the system across multiple LLMs, microarchitectures, and vulnerable functions, reporting success rates up to 100% for Spectre-v1 (Claude Sonnet-4) and 80% for Prime+Probe (Qwen3-Coder) in a deployment stage, along with a demonstration of low-cost generation ($1.25 in under four minutes).

Significance. If the reported success rates reflect verified microarchitectural side effects (such as measurable speculative leakage or cache timing differentials) rather than syntactic or runtime validity alone, uGen could meaningfully reduce the expertise barrier for defenders performing portable vulnerability assessments. The systematic gap analysis and multi-agent design provide a concrete, reproducible method that could be extended to other side-channel attacks; the cross-model and cross-architecture evaluation adds practical value if the functional-correctness claims are substantiated.

major comments (2)

[Abstract] Abstract and Evaluation section: The success rates (100% for Spectre-v1, 80% for Prime+Probe) are presented without an explicit definition of the success metric or verification procedure. It is unclear whether a PoC is deemed successful upon compilation, execution without crash, or only after confirming actual attack effects (e.g., branch-predictor mistraining leakage exceeding noise for Spectre-v1 or statistically significant cache-eviction timing for Prime+Probe). This directly affects whether the central claim of producing 'functionally correct' PoCs for vulnerability assessment holds.
[Evaluation] Evaluation section: No details are supplied on controls for environmental sensitivity, hardware-specific timing noise, statistical significance testing, or how portability across microarchitectures was validated. Without these, the high success rates cannot be distinguished from basic executability, weakening the evidence that the retrieval-augmented multi-agent approach reliably produces usable attack code.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief table summarizing the exact attack primitives injected by the retrieval component for each evaluated attack.
[Section 3] Notation for agent roles and retrieval sources could be made more consistent across figures and text to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our paper. The comments raise valid points regarding the clarity of our success metrics and evaluation details. We address each major comment below and will revise the manuscript accordingly to improve transparency.

read point-by-point responses

Referee: [Abstract] Abstract and Evaluation section: The success rates (100% for Spectre-v1, 80% for Prime+Probe) are presented without an explicit definition of the success metric or verification procedure. It is unclear whether a PoC is deemed successful upon compilation, execution without crash, or only after confirming actual attack effects (e.g., branch-predictor mistraining leakage exceeding noise for Spectre-v1 or statistically significant cache-eviction timing for Prime+Probe). This directly affects whether the central claim of producing 'functionally correct' PoCs for vulnerability assessment holds.

Authors: We acknowledge that the definition of success was not sufficiently explicit in the abstract and evaluation sections. In our experiments, a PoC is considered successful only if it produces the expected microarchitectural side effect, verified by measuring actual leakage or timing differentials that exceed noise thresholds, rather than just successful compilation or execution. We will revise the manuscript to include an explicit definition of the success metric and a detailed description of the verification procedure in the Evaluation section. revision: yes
Referee: [Evaluation] Evaluation section: No details are supplied on controls for environmental sensitivity, hardware-specific timing noise, statistical significance testing, or how portability across microarchitectures was validated. Without these, the high success rates cannot be distinguished from basic executability, weakening the evidence that the retrieval-augmented multi-agent approach reliably produces usable attack code.

Authors: We agree that more details on experimental controls are necessary to substantiate the claims. In the revised version, we will add information on how we controlled for environmental factors, such as using isolated execution environments, performing multiple runs to account for timing noise, applying statistical significance tests (e.g., t-tests on timing measurements), and validating portability by testing on multiple microarchitectures with documented hardware specifications and any necessary code adaptations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation

full rationale

The paper describes an empirical LLM-based framework for generating microarchitectural attack PoCs. Success rates (e.g., 100% for Spectre-v1) are measured by executing and testing generated code on target hardware for functional correctness and side-channel effects, rather than being defined internally or fitted to inputs. The systematic study of LLM gaps and the retrieval-augmented multi-agent design are presented as engineering choices validated externally through experiments across models and microarchitectures, with no equations, self-definitional loops, or load-bearing self-citations that reduce the central results to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; full details unavailable. Core assumptions concern LLM limitations in attack primitives and the corrective power of retrieval augmentation.

axioms (2)

domain assumption LLMs frequently misgenerate or misplace critical attack primitives that can be systematically identified.
Presented as the key challenge guiding the framework design.
domain assumption Retrieval-augmented multi-agent design can supply the missing domain knowledge to produce correct PoCs.
Central mechanism of uGen.

pith-pipeline@v0.9.0 · 5839 in / 1345 out tokens · 64422 ms · 2026-05-19T15:42:18.678473+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

uGen employs a retrieval-augmented, multi-agent design that injects missing domain knowledge to synthesize functionally correct microarchitectural attack PoCs... achieves up to 100% success rate for Spectre-v1 (Claude Sonnet-4) and 80% for Prime+Probe
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Prime+Probe Attack... Spectre attacks exploit branch prediction units... controlled delay... cache hit/miss threshold

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 5 internal anchors

[1]

Introducing Claude 4

Anthropic. Introducing Claude 4. https://www.anthropic.com/news/ claude-4, 2025. Accessed 2026-01-29

work page 2025
[2]

Anthropic’s transparency hub

Anthropic. Anthropic’s transparency hub. https://www.anthropic.com/ transparency, 2026. Accessed 2026-01-29

work page 2026
[3]

Branch history injection: On the effectiveness of hardware mitigations against cross-privilege Spectre-v2 attacks

Enrico Barberis, Pietro Frigo, Marius Muench, Herbert Bos, and Cris- tiano Giuffrida. Branch history injection: On the effectiveness of hardware mitigations against cross-privilege Spectre-v2 attacks. In Kevin R. B. Butler and Kurt Thomas, editors,31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022, pages 971–988. USE...

work page 2022
[4]

SMoTherSpectre: Exploiting speculative execution through port contention

Atri Bhattacharyya, Alexandra Sandulescu, Matthias Neugschwandtner, Alessandro Sorniotti, Babak Falsafi, Mathias Payer, and Anil Kur- mus. SMoTherSpectre: Exploiting speculative execution through port contention. In Lorenzo Cavallaro, Johannes Kinder, XiaoFeng Wang, and Jonathan Katz, editors,Proceedings of the 2019 ACM SIGSAC Conference on Computer and C...

work page 2019
[5]

Rae, Erich Elsen, and Laurent Sifre

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean- Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Or...

work page 2022
[6]

Aim, wait, shoot: How the CacheSniper technique improves unprivileged cache attacks

Samira Briongos, Ida Bruhns, Pedro Malag ´on, Thomas Eisenbarth, and Jos´e Manuel Moya. Aim, wait, shoot: How the CacheSniper technique improves unprivileged cache attacks. InIEEE European Symposium on Security and Privacy, EuroS&P 2021, Vienna, Austria, September 6-10, 2021, pages 683–700. IEEE, 2021

work page 2021
[7]

Unprecedented code change automation: The fusion of LLMs and transformation by example.Proc

Malinda Dilhara, Abhiram Bellur, Timofey Bryksin, and Danny Dig. Unprecedented code change automation: The fusion of LLMs and transformation by example.Proc. ACM Softw. Eng., 1(FSE):631–653, 2024

work page 2024
[8]

De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding.arXiv preprint arXiv:2401.01701,

Aryaz Eghbali and Michael Pradel. De-Hallucinator: Iterative grounding for LLM-based code completion.CoRR, abs/2401.01701, 2024. 14

work page arXiv 2024
[9]

LLM Agents can Autonomously Exploit One-day Vulnerabilities

Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. LLM agents can autonomously exploit one-day vulnerabilities.CoRR, abs/2404.08144, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. CoRR, abs/2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Au- topentester: An LLM agent-based framework for automated pentesting

Yasod Ginige, Akila Niroshan, Sajal Jain, and Suranga Seneviratne. Au- topentester: An LLM agent-based framework for automated pentesting. In24th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2025, Guiyang, China, November 14-17, 2025, pages 163–174. IEEE, 2025

work page 2025
[12]

Flush+Flush: A fast and stealthy cache attack

Daniel Gruss, Cl ´ementine Maurice, Klaus Wagner, and Stefan Mangard. Flush+Flush: A fast and stealthy cache attack. In Juan Caballero, Urko Zurutuza, and Ricardo J. Rodr´ıguez, editors,Detection of Intrusions and Malware, and Vulnerability Assessment - 13th International Conference, DIMVA 2016, San Sebasti´an, Spain, July 7-8, 2016, Proceedings, volume 9...

work page 2016
[13]

Cache template attacks: Automating attacks on inclusive last-level caches

Daniel Gruss, Raphael Spreitzer, and Stefan Mangard. Cache template attacks: Automating attacks on inclusive last-level caches. In Jaeyeon Jung and Thorsten Holz, editors,24th USENIX Security Symposium, USENIX Security 15, Washington, D.C., USA, August 12-14, 2015, pages 897–912. USENIX Association, 2015

work page 2015
[14]

Cross-VM cache attacks on AES.IEEE Trans

Berk G ¨ulmezoglu, Mehmet Sinan Inci, Gorka Irazoqui, Thomas Eisen- barth, and Berk Sunar. Cross-VM cache attacks on AES.IEEE Trans. Multi Scale Comput. Syst., 2(3):211–222, 2016

work page 2016
[15]

Hennessy and David A

John L. Hennessy and David A. Patterson.Computer Architecture - A Quantitative Approach, 5th Edition. Morgan Kaufmann, 2012

work page 2012
[16]

Rain: Transiently leaking data from public clouds using old vulnerabilities

Math ´e Hertogh, Dave Quakkelaar, Thijs Raymakers, Mahesh Hari Sarma, Marius Muench, Herbert Bos, and Erik van der Kouwe. Rain: Transiently leaking data from public clouds using old vulnerabilities. In IEEE Symposium on Security and Privacy, SP 2026, San Francisco, CA, USA, May 18-21, 206. IEEE, 2026. To be published

work page 2026
[17]

Metagpt: Meta programming for A multi- agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J ¨urgen Schmidhuber. Metagpt: Meta programming for A multi- agent collaborative framework. InThe Twelfth International Conference on Learning Representations, ...

work page 2024
[18]

OpenReview.net, 2024

work page 2024
[19]

speculative execution, variant 4: speculative store bypass

Jann Horn. speculative execution, variant 4: speculative store bypass. https://project-zero.issues.chromium.org/issues/42450580, 2018

work page arXiv 2018
[20]

InferFix: End-to-end program repair with LLMs

Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. InferFix: End-to-end program repair with LLMs. In Satish Chandra, Kelly Blincoe, and Paolo Tonella, editors,Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 20...

work page 2023
[21]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 1...

work page 2020
[22]

Spectre mitigations in Microsoft’s C/C++ compiler

Paul Kocher. Spectre mitigations in Microsoft’s C/C++ compiler. https: //www.paulkocher.com/doc/MicrosoftCompilerSpectreMitigation.html,

work page
[23]

Spectre attacks: Exploit- ing speculative execution

Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom. Spectre attacks: Exploit- ing speculative execution. In2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pages 1–19. IEEE, 2019

work page 2019
[24]

Khasawneh, Chengyu Song, and Nael B

Esmaeil Mohammadian Koruyeh, Khaled N. Khasawneh, Chengyu Song, and Nael B. Abu-Ghazaleh. Spectre returns! speculation attacks using the return stack buffer. In Christian Rossow and Yves Younan, editors,12th USENIX Workshop on Offensive Technologies, WOOT 2018, Baltimore, MD, USA, August 13-14, 2018. USENIX Association, 2018

work page 2018
[25]

Retrieval- augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval- augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin...

work page 2020
[26]

Fangfei Liu, Yuval Yarom, Qian Ge, Gernot Heiser, and Ruby B. Lee. Last-level cache side-channel attacks are practical. In2015 IEEE Symposium on Security and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015, pages 605–622. IEEE Computer Society, 2015

work page 2015
[27]

ret2spec: Speculative exe- cution using return stack buffers

Giorgi Maisuradze and Christian Rossow. ret2spec: Speculative exe- cution using return stack buffers. In David Lie, Mohammad Mannan, Michael Backes, and XiaoFeng Wang, editors,Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS 2018, Toronto, ON, Canada, October 15-19, 2018, pages 2109–

work page 2018
[28]

Debugging with open-source large language models: An evaluation

Yacine Majdoub and Eya Ben Charrada. Debugging with open-source large language models: An evaluation. In Xavier Franch, Maya Daneva, Silverio Mart´ınez-Fern´andez, and Luigi Quaranta, editors,Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2024, Barcelona, Spain, October 24-25, 2024, pages 5...

work page 2024
[29]

AutoPen: Towards autonomous penetration testing using LLM-powered agents

Jiahao Mei, Shuangwu Chen, Yuanyi Ma, and Huizi Song. AutoPen: Towards autonomous penetration testing using LLM-powered agents. In Proceedings of the 9th International Conference on Computer Science and Application Engineering, CSAE 2025, Shanghai, China October 19- 21, 2025, pages 1–6. ACM, 2025

work page 2025
[30]

Hacksynth: Llm agent and evaluation framework for autonomous penetration testing.arXiv preprint arXiv:2412.01778, 2024

Lajos Muzsai, David Imolai, and Andr ´as Luk ´acs. HackSynth: LLM agent and evaluation framework for autonomous penetration testing. CoRR, abs/2412.01778, 2024

work page arXiv 2024
[31]

GPT-4o system card

OpenAI. GPT-4o system card. https://cdn.openai.com/ gpt-4o-system-card.pdf, 2024. Accessed 2026-01-29

work page 2024
[32]

Introducing SWE-bench Verified

OpenAI. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/, 2024. Accessed 2026-01-29

work page 2024
[33]

Kemerlis, Simha Sethumadhavan, and An- gelos D

Yossef Oren, Vasileios P. Kemerlis, Simha Sethumadhavan, and An- gelos D. Keromytis. The spy in the sandbox: Practical cache attacks in JavaScript and their implications. In Indrajit Ray, Ninghui Li, and Christopher Kruegel, editors,Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, October 12-16, 2015,...

work page 2015
[34]

Cache attacks and countermeasures: The case of AES

Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache attacks and countermeasures: The case of AES. In David Pointcheval, editor, Topics in Cryptology - CT-RSA 2006, The Cryptographers’ Track at the RSA Conference 2006, San Jose, CA, USA, February 13-17, 2006, Proceedings, volume 3860 ofLecture Notes in Computer Science, pages 1–20. Springer, 2006

work page 2006
[35]

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han

Benji Peng, Ziqian Bi, Qian Niu, Ming Liu, Pohsun Feng, Tianyang Wang, Lawrence KQ Yan, Yizhu Wen, Yichao Zhang, and Caitlyn Heqi Yin. Jailbreaking and mitigation of vulnerabilities in large language models.CoRR, abs/2410.15236, 2024

work page arXiv 2024
[36]

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on developer productivity: Evidence from GitHub Copilot. CoRR, abs/2302.06590, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Prime+Scope: Overcoming the observer effect for high-precision cache contention attacks

Antoon Purnal, Furkan Turan, and Ingrid Verbauwhede. Prime+Scope: Overcoming the observer effect for high-precision cache contention attacks. In Yongdae Kim, Jong Kim, Giovanni Vigna, and Elaine Shi, editors,CCS ’21: 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, November 15 - 19, 2021, pages 2906–292...

work page 2021
[38]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Li...

work page 2024
[39]

Qwen3-Coder: Agentic coding in the world

Qwen Team. Qwen3-Coder: Agentic coding in the world. https://qwen. ai/blog?id=qwen3-coder, 2025. Accessed 2026-01-29

work page 2025
[40]

Augmenting code sequencing with retrieval-augmented generation (rag) for context-aware code synthesis

S Jansi Rani, S G Deepika, D Devdharshini, and Harini Ravindran. Augmenting code sequencing with retrieval-augmented generation (rag) for context-aware code synthesis. In2024 First International Conference on Software, Systems and Information Technology (SSITCON), pages 1– 7, 2024. 15

work page 2024
[41]

Branch privilege injection: Compromising Spectre v2 hardware mitigations by exploit- ing branch predictor race conditions

Sandro R ¨uegge, Johannes Wikner, and Kaveh Razavi. Branch privilege injection: Compromising Spectre v2 hardware mitigations by exploit- ing branch predictor race conditions. In Lujo Bauer and Giancarlo Pellegrino, editors,34th USENIX Security Symposium, USENIX Secu- rity 2025, Seattle, WA, USA, August 13-15, 2025, pages 2615–2631. USENIX Association, 2025

work page 2025
[42]

MalGEN: A Testbed for Modeling and Evaluating Malware Behaviors

Bikash Saha and Sandeep Kumar Shukla. MalGEN: A generative agent framework for modeling malicious software in cybersecurity.CoRR, abs/2506.07586, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

CRAKEN: cybersecurity LLM agent with knowledge-based execution.CoRR, abs/2505.17107, 2025

Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Cha- ran Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. CRAKEN: cybersecurity LLM agent with knowledge-based execution.CoRR, abs/2505.17107, 2025

work page arXiv 2025
[44]

Edward Suh, and Udit Gupta

Michael Shen, Muhammad Umar, Kiwan Maeng, G. Edward Suh, and Udit Gupta. Towards understanding systems trade-offs in retrieval- augmented generation model inference.CoRR, abs/2412.11854, 2024

work page arXiv 2024
[45]

PentestAgent: Incor- porating LLM agents to automated penetration testing

Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Wei Ruan. PentestAgent: Incor- porating LLM agents to automated penetration testing. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2025, Hanoi, Vietnam, August 25-29, 2025, pages 375–391. ACM, 2025

work page 2025
[46]

PoCGen: Generating proof-of-concept exploits for vulnerabilities in Npm packages.CoRR, abs/2506.04962, 2025

Deniz Simsek, Aryaz Eghbali, and Michael Pradel. PoCGen: Generating proof-of-concept exploits for vulnerabilities in Npm packages.CoRR, abs/2506.04962, 2025

work page arXiv 2025
[47]

SMaCk: Efficient instruction cache attacks via self-modifying code conflicts

Seonghun Son, Daniel Moghimi, and Berk G ¨ulmezoglu. SMaCk: Efficient instruction cache attacks via self-modifying code conflicts. In Lieven Eeckhout, Georgios Smaragdakis, Katai Liang, Adrian Sampson, Martha A. Kim, and Christopher J. Rossbach, editors,Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages...

work page 2025
[48]

D-CIPHER: Dynamic collaborative intel- ligent multi-agent system with planner and heterogeneous executors for offensive security.CoRR, abs/2502.10931, 2025

Meet Udeshi, Minghao Shao, Haoran Xi, Nanda Rani, Kimberly Milner, Venkata Sai Charan Putrevu, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. D-CIPHER: Dynamic collaborative intel- ligent multi-agent system with planner and heterogeneous executors for offensive security.CoRR, abs...

work page arXiv 2025
[49]

From CVE entries to verifiable exploits: An automated multi- agent framework for reproducing CVEs.CoRR, abs/2509.01835, 2025

Saad Ullah, Praneeth Balasubramanian, Wenbo Guo, Amanda Burnett, Hammond Pearce, Christopher Kruegel, Giovanni Vigna, and Gianluca Stringhini. From CVE entries to verifiable exploits: An automated multi- agent framework for reproducing CVEs.CoRR, abs/2509.01835, 2025

work page arXiv 2025
[50]

Sahraoui

Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, and Houari A. Sahraoui. Exploring parameter-efficient fine-tuning techniques for code generation with large language models.ACM Trans. Softw. Eng. Methodol., 34(7):204:1–204:25, 2025

work page 2025
[51]

Training Solo: On the limita- tions of domain isolation against Spectre-v2 attacks

Sander Wiebing and Cristiano Giuffrida. Training Solo: On the limita- tions of domain isolation against Spectre-v2 attacks. In Marina Blanton, William Enck, and Cristina Nita-Rotaru, editors,IEEE Symposium on Security and Privacy, SP 2025, San Francisco, CA, USA, May 12-15, 2025, pages 3599–3616. IEEE, 2025

work page 2025
[52]

RETBLEED: arbitrary speculative code execution with return instructions

Johannes Wikner and Kaveh Razavi. RETBLEED: arbitrary speculative code execution with return instructions. In Kevin R. B. Butler and Kurt Thomas, editors,31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022, pages 3825–3842. USENIX Association, 2022

work page 2022
[53]

Autogen: Enabling next-gen LLM applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Has- san Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversations. In First Conference on Language Modeling, 2024

work page 2024
[54]

Hosein Yavarzadeh, Archit Agarwal, Max Christman, Christina Garman, Daniel Genkin, Andrew Kwong, Daniel Moghimi, Deian Stefan, Kazem Taram, and Dean M. Tullsen. Pathfinder: High-resolution control-flow attacks exploiting the conditional branch predictor. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Langua...

work page 2024
[55]

TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment

Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, and Yiling Lou. TRANSAGENT: an LLM-based multi-agent system for code translation.CoRR, abs/2409.19894, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

A systematic study on generating web vulnerability proof-of-concepts using large language models.CoRR, abs/2510.10148, 2025

Mengyao Zhao, Kaixuan Li, Lyuye Zhang, Wenjing Dang, Chenggong Ding, Sen Chen, and Zheli Liu. A systematic study on generating web vulnerability proof-of-concepts using large language models.CoRR, abs/2510.10148, 2025

work page arXiv 2025
[57]

forgetting

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm- as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural...

work page 2023
[59]

Craft Misprediction Conditions: Design [...]

work page
[60]

This increases the likelihood of misprediction when the speculative path is executed

Interleave with Legitimate Accesses: Mix legitimate accesses with speculative ones to train the branch predictor. This increases the likelihood of misprediction when the speculative path is executed

work page
[61]

This should [...] Expert Feedback ADD the following details under the Implementation Guidelines: •Interleave safe and malicious index values within the same loop

Ensure Speculative Execution: Use [...] Placement Guidance:Insert the controlled branch misprediction logic within the loop that prepares the speculative execution environment. This should [...] Expert Feedback ADD the following details under the Implementation Guidelines: •Interleave safe and malicious index values within the same loop. •Use branchless a...

work page
[62]

Identify the Conditional Branch: Locate [...]

work page
[63]

Interleave with Legitimate Accesses: Mix legitimate accesses with 17 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16 M17 M18 M19 M20 0 50 100 100 60 10 0 20 40 20 20 90 10 80 90 80 90 0 80 30 20 100 100100 100 80 80 80 80 100 90 90 80 50 90 0 0 0 90 60 30 90 90 100 60 50 60 70 70 70 70 70 70 70 90 40 40 0 40 80 10 100 100 Metric Success Rate (%) Cl...

work page
[64]

It’s a secret!!

Ensure Speculative Execution: Use [...] Placement Guidance:Insert the controlled branch misprediction logic within the loop that prepares the speculative execution environment. This should [...] •Insert this interleaving logic before the index is used as the input to a victim function. •This step must not be inside the victim function, as the attacker sho...

work page

[1] [1]

Introducing Claude 4

Anthropic. Introducing Claude 4. https://www.anthropic.com/news/ claude-4, 2025. Accessed 2026-01-29

work page 2025

[2] [2]

Anthropic’s transparency hub

Anthropic. Anthropic’s transparency hub. https://www.anthropic.com/ transparency, 2026. Accessed 2026-01-29

work page 2026

[3] [3]

Branch history injection: On the effectiveness of hardware mitigations against cross-privilege Spectre-v2 attacks

Enrico Barberis, Pietro Frigo, Marius Muench, Herbert Bos, and Cris- tiano Giuffrida. Branch history injection: On the effectiveness of hardware mitigations against cross-privilege Spectre-v2 attacks. In Kevin R. B. Butler and Kurt Thomas, editors,31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022, pages 971–988. USE...

work page 2022

[4] [4]

SMoTherSpectre: Exploiting speculative execution through port contention

Atri Bhattacharyya, Alexandra Sandulescu, Matthias Neugschwandtner, Alessandro Sorniotti, Babak Falsafi, Mathias Payer, and Anil Kur- mus. SMoTherSpectre: Exploiting speculative execution through port contention. In Lorenzo Cavallaro, Johannes Kinder, XiaoFeng Wang, and Jonathan Katz, editors,Proceedings of the 2019 ACM SIGSAC Conference on Computer and C...

work page 2019

[5] [5]

Rae, Erich Elsen, and Laurent Sifre

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean- Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Or...

work page 2022

[6] [6]

Aim, wait, shoot: How the CacheSniper technique improves unprivileged cache attacks

Samira Briongos, Ida Bruhns, Pedro Malag ´on, Thomas Eisenbarth, and Jos´e Manuel Moya. Aim, wait, shoot: How the CacheSniper technique improves unprivileged cache attacks. InIEEE European Symposium on Security and Privacy, EuroS&P 2021, Vienna, Austria, September 6-10, 2021, pages 683–700. IEEE, 2021

work page 2021

[7] [7]

Unprecedented code change automation: The fusion of LLMs and transformation by example.Proc

Malinda Dilhara, Abhiram Bellur, Timofey Bryksin, and Danny Dig. Unprecedented code change automation: The fusion of LLMs and transformation by example.Proc. ACM Softw. Eng., 1(FSE):631–653, 2024

work page 2024

[8] [8]

De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding.arXiv preprint arXiv:2401.01701,

Aryaz Eghbali and Michael Pradel. De-Hallucinator: Iterative grounding for LLM-based code completion.CoRR, abs/2401.01701, 2024. 14

work page arXiv 2024

[9] [9]

LLM Agents can Autonomously Exploit One-day Vulnerabilities

Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang. LLM agents can autonomously exploit one-day vulnerabilities.CoRR, abs/2404.08144, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. CoRR, abs/2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Au- topentester: An LLM agent-based framework for automated pentesting

Yasod Ginige, Akila Niroshan, Sajal Jain, and Suranga Seneviratne. Au- topentester: An LLM agent-based framework for automated pentesting. In24th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2025, Guiyang, China, November 14-17, 2025, pages 163–174. IEEE, 2025

work page 2025

[12] [12]

Flush+Flush: A fast and stealthy cache attack

Daniel Gruss, Cl ´ementine Maurice, Klaus Wagner, and Stefan Mangard. Flush+Flush: A fast and stealthy cache attack. In Juan Caballero, Urko Zurutuza, and Ricardo J. Rodr´ıguez, editors,Detection of Intrusions and Malware, and Vulnerability Assessment - 13th International Conference, DIMVA 2016, San Sebasti´an, Spain, July 7-8, 2016, Proceedings, volume 9...

work page 2016

[13] [13]

Cache template attacks: Automating attacks on inclusive last-level caches

Daniel Gruss, Raphael Spreitzer, and Stefan Mangard. Cache template attacks: Automating attacks on inclusive last-level caches. In Jaeyeon Jung and Thorsten Holz, editors,24th USENIX Security Symposium, USENIX Security 15, Washington, D.C., USA, August 12-14, 2015, pages 897–912. USENIX Association, 2015

work page 2015

[14] [14]

Cross-VM cache attacks on AES.IEEE Trans

Berk G ¨ulmezoglu, Mehmet Sinan Inci, Gorka Irazoqui, Thomas Eisen- barth, and Berk Sunar. Cross-VM cache attacks on AES.IEEE Trans. Multi Scale Comput. Syst., 2(3):211–222, 2016

work page 2016

[15] [15]

Hennessy and David A

John L. Hennessy and David A. Patterson.Computer Architecture - A Quantitative Approach, 5th Edition. Morgan Kaufmann, 2012

work page 2012

[16] [16]

Rain: Transiently leaking data from public clouds using old vulnerabilities

Math ´e Hertogh, Dave Quakkelaar, Thijs Raymakers, Mahesh Hari Sarma, Marius Muench, Herbert Bos, and Erik van der Kouwe. Rain: Transiently leaking data from public clouds using old vulnerabilities. In IEEE Symposium on Security and Privacy, SP 2026, San Francisco, CA, USA, May 18-21, 206. IEEE, 2026. To be published

work page 2026

[17] [17]

Metagpt: Meta programming for A multi- agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J ¨urgen Schmidhuber. Metagpt: Meta programming for A multi- agent collaborative framework. InThe Twelfth International Conference on Learning Representations, ...

work page 2024

[18] [18]

OpenReview.net, 2024

work page 2024

[19] [19]

speculative execution, variant 4: speculative store bypass

Jann Horn. speculative execution, variant 4: speculative store bypass. https://project-zero.issues.chromium.org/issues/42450580, 2018

work page arXiv 2018

[20] [20]

InferFix: End-to-end program repair with LLMs

Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. InferFix: End-to-end program repair with LLMs. In Satish Chandra, Kelly Blincoe, and Paolo Tonella, editors,Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 20...

work page 2023

[21] [21]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 1...

work page 2020

[22] [22]

Spectre mitigations in Microsoft’s C/C++ compiler

Paul Kocher. Spectre mitigations in Microsoft’s C/C++ compiler. https: //www.paulkocher.com/doc/MicrosoftCompilerSpectreMitigation.html,

work page

[23] [23]

Spectre attacks: Exploit- ing speculative execution

Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz, and Yuval Yarom. Spectre attacks: Exploit- ing speculative execution. In2019 IEEE Symposium on Security and Privacy, SP 2019, San Francisco, CA, USA, May 19-23, 2019, pages 1–19. IEEE, 2019

work page 2019

[24] [24]

Khasawneh, Chengyu Song, and Nael B

Esmaeil Mohammadian Koruyeh, Khaled N. Khasawneh, Chengyu Song, and Nael B. Abu-Ghazaleh. Spectre returns! speculation attacks using the return stack buffer. In Christian Rossow and Yves Younan, editors,12th USENIX Workshop on Offensive Technologies, WOOT 2018, Baltimore, MD, USA, August 13-14, 2018. USENIX Association, 2018

work page 2018

[25] [25]

Retrieval- augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval- augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin...

work page 2020

[26] [26]

Fangfei Liu, Yuval Yarom, Qian Ge, Gernot Heiser, and Ruby B. Lee. Last-level cache side-channel attacks are practical. In2015 IEEE Symposium on Security and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015, pages 605–622. IEEE Computer Society, 2015

work page 2015

[27] [27]

ret2spec: Speculative exe- cution using return stack buffers

Giorgi Maisuradze and Christian Rossow. ret2spec: Speculative exe- cution using return stack buffers. In David Lie, Mohammad Mannan, Michael Backes, and XiaoFeng Wang, editors,Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS 2018, Toronto, ON, Canada, October 15-19, 2018, pages 2109–

work page 2018

[28] [28]

Debugging with open-source large language models: An evaluation

Yacine Majdoub and Eya Ben Charrada. Debugging with open-source large language models: An evaluation. In Xavier Franch, Maya Daneva, Silverio Mart´ınez-Fern´andez, and Luigi Quaranta, editors,Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2024, Barcelona, Spain, October 24-25, 2024, pages 5...

work page 2024

[29] [29]

AutoPen: Towards autonomous penetration testing using LLM-powered agents

Jiahao Mei, Shuangwu Chen, Yuanyi Ma, and Huizi Song. AutoPen: Towards autonomous penetration testing using LLM-powered agents. In Proceedings of the 9th International Conference on Computer Science and Application Engineering, CSAE 2025, Shanghai, China October 19- 21, 2025, pages 1–6. ACM, 2025

work page 2025

[30] [30]

Hacksynth: Llm agent and evaluation framework for autonomous penetration testing.arXiv preprint arXiv:2412.01778, 2024

Lajos Muzsai, David Imolai, and Andr ´as Luk ´acs. HackSynth: LLM agent and evaluation framework for autonomous penetration testing. CoRR, abs/2412.01778, 2024

work page arXiv 2024

[31] [31]

GPT-4o system card

OpenAI. GPT-4o system card. https://cdn.openai.com/ gpt-4o-system-card.pdf, 2024. Accessed 2026-01-29

work page 2024

[32] [32]

Introducing SWE-bench Verified

OpenAI. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/, 2024. Accessed 2026-01-29

work page 2024

[33] [33]

Kemerlis, Simha Sethumadhavan, and An- gelos D

Yossef Oren, Vasileios P. Kemerlis, Simha Sethumadhavan, and An- gelos D. Keromytis. The spy in the sandbox: Practical cache attacks in JavaScript and their implications. In Indrajit Ray, Ninghui Li, and Christopher Kruegel, editors,Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, CO, USA, October 12-16, 2015,...

work page 2015

[34] [34]

Cache attacks and countermeasures: The case of AES

Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache attacks and countermeasures: The case of AES. In David Pointcheval, editor, Topics in Cryptology - CT-RSA 2006, The Cryptographers’ Track at the RSA Conference 2006, San Jose, CA, USA, February 13-17, 2006, Proceedings, volume 3860 ofLecture Notes in Computer Science, pages 1–20. Springer, 2006

work page 2006

[35] [35]

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han

Benji Peng, Ziqian Bi, Qian Niu, Ming Liu, Pohsun Feng, Tianyang Wang, Lawrence KQ Yan, Yizhu Wen, Yichao Zhang, and Caitlyn Heqi Yin. Jailbreaking and mitigation of vulnerabilities in large language models.CoRR, abs/2410.15236, 2024

work page arXiv 2024

[36] [36]

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on developer productivity: Evidence from GitHub Copilot. CoRR, abs/2302.06590, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Prime+Scope: Overcoming the observer effect for high-precision cache contention attacks

Antoon Purnal, Furkan Turan, and Ingrid Verbauwhede. Prime+Scope: Overcoming the observer effect for high-precision cache contention attacks. In Yongdae Kim, Jong Kim, Giovanni Vigna, and Elaine Shi, editors,CCS ’21: 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, Republic of Korea, November 15 - 19, 2021, pages 2906–292...

work page 2021

[38] [38]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Li...

work page 2024

[39] [39]

Qwen3-Coder: Agentic coding in the world

Qwen Team. Qwen3-Coder: Agentic coding in the world. https://qwen. ai/blog?id=qwen3-coder, 2025. Accessed 2026-01-29

work page 2025

[40] [40]

Augmenting code sequencing with retrieval-augmented generation (rag) for context-aware code synthesis

S Jansi Rani, S G Deepika, D Devdharshini, and Harini Ravindran. Augmenting code sequencing with retrieval-augmented generation (rag) for context-aware code synthesis. In2024 First International Conference on Software, Systems and Information Technology (SSITCON), pages 1– 7, 2024. 15

work page 2024

[41] [41]

Branch privilege injection: Compromising Spectre v2 hardware mitigations by exploit- ing branch predictor race conditions

Sandro R ¨uegge, Johannes Wikner, and Kaveh Razavi. Branch privilege injection: Compromising Spectre v2 hardware mitigations by exploit- ing branch predictor race conditions. In Lujo Bauer and Giancarlo Pellegrino, editors,34th USENIX Security Symposium, USENIX Secu- rity 2025, Seattle, WA, USA, August 13-15, 2025, pages 2615–2631. USENIX Association, 2025

work page 2025

[42] [42]

MalGEN: A Testbed for Modeling and Evaluating Malware Behaviors

Bikash Saha and Sandeep Kumar Shukla. MalGEN: A generative agent framework for modeling malicious software in cybersecurity.CoRR, abs/2506.07586, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

CRAKEN: cybersecurity LLM agent with knowledge-based execution.CoRR, abs/2505.17107, 2025

Minghao Shao, Haoran Xi, Nanda Rani, Meet Udeshi, Venkata Sai Cha- ran Putrevu, Kimberly Milner, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. CRAKEN: cybersecurity LLM agent with knowledge-based execution.CoRR, abs/2505.17107, 2025

work page arXiv 2025

[44] [44]

Edward Suh, and Udit Gupta

Michael Shen, Muhammad Umar, Kiwan Maeng, G. Edward Suh, and Udit Gupta. Towards understanding systems trade-offs in retrieval- augmented generation model inference.CoRR, abs/2412.11854, 2024

work page arXiv 2024

[45] [45]

PentestAgent: Incor- porating LLM agents to automated penetration testing

Xiangmin Shen, Lingzhi Wang, Zhenyuan Li, Yan Chen, Wencheng Zhao, Dawei Sun, Jiashui Wang, and Wei Ruan. PentestAgent: Incor- porating LLM agents to automated penetration testing. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2025, Hanoi, Vietnam, August 25-29, 2025, pages 375–391. ACM, 2025

work page 2025

[46] [46]

PoCGen: Generating proof-of-concept exploits for vulnerabilities in Npm packages.CoRR, abs/2506.04962, 2025

Deniz Simsek, Aryaz Eghbali, and Michael Pradel. PoCGen: Generating proof-of-concept exploits for vulnerabilities in Npm packages.CoRR, abs/2506.04962, 2025

work page arXiv 2025

[47] [47]

SMaCk: Efficient instruction cache attacks via self-modifying code conflicts

Seonghun Son, Daniel Moghimi, and Berk G ¨ulmezoglu. SMaCk: Efficient instruction cache attacks via self-modifying code conflicts. In Lieven Eeckhout, Georgios Smaragdakis, Katai Liang, Adrian Sampson, Martha A. Kim, and Christopher J. Rossbach, editors,Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages...

work page 2025

[48] [48]

D-CIPHER: Dynamic collaborative intel- ligent multi-agent system with planner and heterogeneous executors for offensive security.CoRR, abs/2502.10931, 2025

Meet Udeshi, Minghao Shao, Haoran Xi, Nanda Rani, Kimberly Milner, Venkata Sai Charan Putrevu, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, and Muhammad Shafique. D-CIPHER: Dynamic collaborative intel- ligent multi-agent system with planner and heterogeneous executors for offensive security.CoRR, abs...

work page arXiv 2025

[49] [49]

From CVE entries to verifiable exploits: An automated multi- agent framework for reproducing CVEs.CoRR, abs/2509.01835, 2025

Saad Ullah, Praneeth Balasubramanian, Wenbo Guo, Amanda Burnett, Hammond Pearce, Christopher Kruegel, Giovanni Vigna, and Gianluca Stringhini. From CVE entries to verifiable exploits: An automated multi- agent framework for reproducing CVEs.CoRR, abs/2509.01835, 2025

work page arXiv 2025

[50] [50]

Sahraoui

Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, and Houari A. Sahraoui. Exploring parameter-efficient fine-tuning techniques for code generation with large language models.ACM Trans. Softw. Eng. Methodol., 34(7):204:1–204:25, 2025

work page 2025

[51] [51]

Training Solo: On the limita- tions of domain isolation against Spectre-v2 attacks

Sander Wiebing and Cristiano Giuffrida. Training Solo: On the limita- tions of domain isolation against Spectre-v2 attacks. In Marina Blanton, William Enck, and Cristina Nita-Rotaru, editors,IEEE Symposium on Security and Privacy, SP 2025, San Francisco, CA, USA, May 12-15, 2025, pages 3599–3616. IEEE, 2025

work page 2025

[52] [52]

RETBLEED: arbitrary speculative code execution with return instructions

Johannes Wikner and Kaveh Razavi. RETBLEED: arbitrary speculative code execution with return instructions. In Kevin R. B. Butler and Kurt Thomas, editors,31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022, pages 3825–3842. USENIX Association, 2022

work page 2022

[53] [53]

Autogen: Enabling next-gen LLM applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Has- san Awadallah, Ryen W White, Doug Burger, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversations. In First Conference on Language Modeling, 2024

work page 2024

[54] [54]

Hosein Yavarzadeh, Archit Agarwal, Max Christman, Christina Garman, Daniel Genkin, Andrew Kwong, Daniel Moghimi, Deian Stefan, Kazem Taram, and Dean M. Tullsen. Pathfinder: High-resolution control-flow attacks exploiting the conditional branch predictor. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Langua...

work page 2024

[55] [55]

TransAgent: Enhancing LLM-Based Code Translation via Fine-Grained Execution Alignment

Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, and Yiling Lou. TRANSAGENT: an LLM-based multi-agent system for code translation.CoRR, abs/2409.19894, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

A systematic study on generating web vulnerability proof-of-concepts using large language models.CoRR, abs/2510.10148, 2025

Mengyao Zhao, Kaixuan Li, Lyuye Zhang, Wenjing Dang, Chenggong Ding, Sen Chen, and Zheli Liu. A systematic study on generating web vulnerability proof-of-concepts using large language models.CoRR, abs/2510.10148, 2025

work page arXiv 2025

[57] [57]

forgetting

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm- as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural...

work page 2023

[58] [59]

Craft Misprediction Conditions: Design [...]

work page

[59] [60]

This increases the likelihood of misprediction when the speculative path is executed

Interleave with Legitimate Accesses: Mix legitimate accesses with speculative ones to train the branch predictor. This increases the likelihood of misprediction when the speculative path is executed

work page

[60] [61]

This should [...] Expert Feedback ADD the following details under the Implementation Guidelines: •Interleave safe and malicious index values within the same loop

Ensure Speculative Execution: Use [...] Placement Guidance:Insert the controlled branch misprediction logic within the loop that prepares the speculative execution environment. This should [...] Expert Feedback ADD the following details under the Implementation Guidelines: •Interleave safe and malicious index values within the same loop. •Use branchless a...

work page

[61] [62]

Identify the Conditional Branch: Locate [...]

work page

[62] [63]

Interleave with Legitimate Accesses: Mix legitimate accesses with 17 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 M16 M17 M18 M19 M20 0 50 100 100 60 10 0 20 40 20 20 90 10 80 90 80 90 0 80 30 20 100 100100 100 80 80 80 80 100 90 90 80 50 90 0 0 0 90 60 30 90 90 100 60 50 60 70 70 70 70 70 70 70 90 40 40 0 40 80 10 100 100 Metric Success Rate (%) Cl...

work page

[63] [64]

It’s a secret!!

Ensure Speculative Execution: Use [...] Placement Guidance:Insert the controlled branch misprediction logic within the loop that prepares the speculative execution environment. This should [...] •Insert this interleaving logic before the index is used as the input to a victim function. •This step must not be inside the victim function, as the attacker sho...

work page