arxiv: 2605.11029 · v1 · submitted 2026-05-10 · 💻 cs.CR · cs.AI

Recognition: no theorem link

FragBench: Cross-Session Attacks Hidden in Benign-Looking Fragments

Arthur Coll\'e, Astha Mehta, Cedric Lam, David Williams-King, Hengxu Li, Jamie Johnson, Linh Le, My (Isabella) Luong, Niruthiha Selvanayagam, Olivia McGoffin, Phuc-Nguyen Nguyen, Raymond Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:09 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords FragBenchLLM safetycross-session attacksfragmented promptsgraph neural networksadversarial benchmarkingcybersecurityinteraction graphs

0 comments

The pith

Graph models recover malicious signals in LLM fragments spread across separate sessions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds FragBench from 24 real cyber-incident campaigns to test attacks that split a harmful goal into sub-prompts, each of which looks benign to a single-turn safety judge. It creates paired tasks: an adversarial rewriter that hardens the fragments against isolated checks, and a graph-based detector trained on the cross-session interaction trails that the fragments leave behind. Four GNN variants plus three classical ML baselines all identify the underlying attack pattern at event-level F1 scores of 0.88 to 0.96, even though the single-turn judge scores near chance. A reader would care because current safety evaluations ignore multi-session context, leaving a gap that real attackers can exploit by distributing their intent over time.

Core claim

FragBench keeps the full attack trail from real campaigns, including the multi-fragment kill chain, per-fragment safety verdicts, sandbox execution traces, and matched benign sessions. The single-turn judge is defeated by construction, yet modeling the interaction graph among fragments from the same user allows both GNNs and classical ML methods to recover the cross-session malicious feature at high accuracy. This establishes that defense against fragmented LLM misuse requires graph-based reasoning over user-level interactions rather than evaluation of isolated prompts.

What carries the argument

The cross-session interaction graph, which links fragments from the same campaign into a single user trail so that collective malicious structure becomes visible even when no individual node triggers a safety flag.

If this is right

Single-turn safety judges are insufficient by design for detecting split malicious goals.
Graph-based detectors on user interaction data achieve F1 scores of 0.88-0.96 where isolated checks fail.
Effective defense requires shifting evaluation from individual prompts to multi-session interaction graphs.
The released rewriter, sandbox harness, and detector code enable systematic testing of new graph defenses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production LLM systems may need persistent user-session graphs to catch attacks that unfold over hours or days.
The same interaction-graph approach could be tested on other covert multi-step misuse patterns beyond cyber campaigns.
Adding execution-trace features to the graph nodes might allow earlier blocking before the full kill chain completes.

Load-bearing premise

Fragments from real campaigns can be rewritten to each pass a single-turn safety judge on its own while the combination across sessions remains malicious, and the chosen 24 campaigns plus benign matches form a representative test distribution.

What would settle it

Running the released GNN and classical detectors on a fresh collection of fragmented attack campaigns drawn from sources outside the original 24 and observing whether aggregate event-level F1 falls below 0.7 would falsify the claim that cross-session graph modeling reliably recovers the hidden malicious signal.

Figures

Figures reproduced from arXiv: 2605.11029 by Arthur Coll\'e, Astha Mehta, Cedric Lam, David Williams-King, Hengxu Li, Jamie Johnson, Linh Le, My (Isabella) Luong, Niruthiha Selvanayagam, Olivia McGoffin, Phuc-Nguyen Nguyen, Raymond Lee.

**Figure 2.** Figure 2: Examples of the six rewriting styles that we used to disguise attack fragments. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: MCP harness architecture. A central mcp-client ingests campaign JSON files, selects the attack based on the style and seed, and routes tool calls to a fleet of 24 specialised MCP servers via FastMCP (HTTP + SSE). All agent interactions are captured as structured .jsonl session logs served through a viewer at port 8787. Support is available for self-hosted LLMs as well as those accessed via openrouter [PIT… view at source ↗

**Figure 4.** Figure 4: MCP validation frontend views. The Live view streams fragment execution in real time, [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

An attacker can split a malicious goal into sub-prompts that each look benign on their own and only become harmful in combination. Existing LLM safety benchmarks evaluate prompts one at a time, or across turns of a single chat, and so do not look for a malicious signal spread across separate sessions with no shared context. We build FragBench, a benchmark drawn from 24 real-world cyber-incident campaigns, which keeps the full attack trail: the multi-fragment kill chain, the per-fragment safety-judge verdicts, sandboxed execution traces, and a matched set of benign cover sessions. FragBench splits this trail into two paired tasks: an adversarial rewriter that hardens fragments against a single-turn safety judge (FragBench Attack), and a graph-based user-level detector trained on the resulting interactions (FragBench Defense). The single-turn judge is near chance on the released corpus by construction, but four GNN variants and three classical-ML baselines all recover the cross-session feature, reaching aggregate event-level F1 = 0.88-0.96. Defending against fragmented LLM misuse therefore requires modeling the cross-session interaction graph, rather than isolated prompts. Our generator, rewriter, sandbox harness, and detector are released at https://github.com/LidaSafety/fragbench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FragBench shows graph detectors can spot cross-session fragmented attacks from real campaigns where single-turn judges fail, but the rewriter may introduce exploitable patterns.

read the letter

The main thing here is that FragBench constructs a test set of malicious goals split into fragments across separate sessions, drawn from 24 real cyber campaigns, and finds that GNNs and classical ML baselines recover the signal with F1 scores of 0.88-0.96 while single-turn judges stay near chance. They also release the rewriter, sandbox harness, and detectors at the GitHub link. This directly tests an attack surface that current single-prompt or single-chat benchmarks miss. The grounding in actual incident trails and the inclusion of execution traces plus matched benign sessions gives the setup more weight than purely made-up examples. The split into an adversarial rewriter task and a graph-based defense task is a clean way to frame the problem. The empirical result that multiple detector types all succeed on the held-out interactions is presented straightforwardly. The main soft spot is scale and construction. Twenty-four campaigns is modest, so it is unclear how much the numbers reflect selection or how well they would hold on a wider set of incidents. The central risk is that the rewriter creates consistent cross-fragment lexical or structural cues that the graph models pick up on, rather than a general cross-session malicious feature. Without ablations on rewriter variants or feature importance analysis, it is hard to rule out that the high performance comes from artifacts preserved or added during rewriting. The abstract lacks error bars, split details, or validation steps for the campaign set, which leaves the F1 numbers harder to interpret. This is for people working on LLM safety benchmarks and red-teaming who care about multi-session threats. A reader focused on new evaluation methods or graph applications in security would find the dataset and pipeline worth examining. It deserves peer review because the core gap it identifies is real and the release makes follow-up work feasible, even if revisions should add robustness checks on the rewriter and broader validation.

Referee Report

3 major / 2 minor

Summary. The paper introduces FragBench, a benchmark constructed from 24 real-world cyber-incident campaigns that preserves full attack trails including multi-fragment kill chains, per-fragment safety-judge verdicts, sandboxed execution traces, and matched benign sessions. It defines two paired tasks: FragBench Attack, an adversarial rewriter that hardens individual fragments to evade a single-turn safety judge while keeping the combination malicious, and FragBench Defense, a graph-based detector (four GNN variants plus three classical-ML baselines) trained on the resulting cross-session interaction graphs. The single-turn judge performs near chance by construction on the released corpus, yet the detectors recover the cross-session malicious signal and achieve aggregate event-level F1 scores of 0.88-0.96. The work concludes that defending against fragmented LLM misuse requires modeling the cross-session interaction graph rather than isolated prompts, and releases the generator, rewriter, sandbox harness, and detector.

Significance. If the benchmark construction avoids systematic artifacts and the high F1 scores reflect genuine cross-session features rather than rewriter-induced patterns, the result is significant: it identifies a previously unaddressed threat model in LLM safety (multi-session fragmentation with no shared context) and demonstrates that graph-based models can detect it where single-turn or single-session approaches fail. The explicit release of code, rewriter, and harness is a clear strength that enables reproducibility and follow-on work. The empirical demonstration across multiple model families strengthens the case that cross-session modeling is necessary.

major comments (3)

[§4, §5.1] §4 (experimental setup) and §5.1 (results): the reported aggregate event-level F1 range of 0.88-0.96 is presented without error bars, statistical significance tests, or explicit description of how the 24 campaigns were partitioned into train/validation/test sets. This makes it impossible to determine whether the cross-session recovery generalizes or reflects campaign-specific selection effects.
[§3.2] §3.2 (FragBench Attack rewriter): no ablation is reported on rewriter variants, alternative single-turn judges, or feature-importance analysis that isolates inter-fragment relational signals from intra-fragment lexical or structural artifacts introduced by the rewriter itself. Without such controls, it remains possible that both GNNs and classical baselines exploit consistent rewriting patterns rather than a general cross-session malicious feature.
[§3.1] §3.1 (campaign selection): the manuscript states that the 24 campaigns are drawn from real-world cyber-incident trails but provides no explicit criteria for inclusion, diversity metrics, or validation that the matched benign sessions form an unbiased distribution. This directly affects the weakest assumption underlying the central claim that the benchmark is representative.

minor comments (2)

[abstract, §5] The abstract and §5 refer to 'event-level F1' without a precise definition or pseudocode for how events are aggregated from fragment-level predictions.
[§5] Table captions and axis labels in the results figures should explicitly state the number of campaigns and sessions used for each reported metric.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, agreeing where revisions are needed to improve clarity and rigor, and commit to incorporating these changes in the revised manuscript.

read point-by-point responses

Referee: [§4, §5.1] §4 (experimental setup) and §5.1 (results): the reported aggregate event-level F1 range of 0.88-0.96 is presented without error bars, statistical significance tests, or explicit description of how the 24 campaigns were partitioned into train/validation/test sets. This makes it impossible to determine whether the cross-session recovery generalizes or reflects campaign-specific selection effects.

Authors: We agree that these details are essential for evaluating generalization and were insufficiently reported. In the revised manuscript we will add error bars computed via bootstrapping over campaigns and 5-fold cross-validation, include statistical significance tests (e.g., McNemar’s test and paired t-tests) comparing GNN variants against classical baselines, and explicitly describe the partitioning (campaign-level splits with no session leakage across folds). We will also report per-campaign F1 scores to demonstrate consistency rather than aggregate-only results. revision: yes
Referee: [§3.2] §3.2 (FragBench Attack rewriter): no ablation is reported on rewriter variants, alternative single-turn judges, or feature-importance analysis that isolates inter-fragment relational signals from intra-fragment lexical or structural artifacts introduced by the rewriter itself. Without such controls, it remains possible that both GNNs and classical baselines exploit consistent rewriting patterns rather than a general cross-session malicious feature.

Authors: This is a valid concern about potential confounding. We will add the requested controls in the revision: ablations across rewriter variants (different adversarial objectives and prompt templates), evaluation using alternative single-turn safety judges, and feature-importance analysis (GNN attention weights and permutation importance) that quantifies the contribution of inter-fragment edges versus intra-fragment node features. These additions will help isolate the cross-session relational signal from rewriter-induced artifacts. revision: yes
Referee: [§3.1] §3.1 (campaign selection): the manuscript states that the 24 campaigns are drawn from real-world cyber-incident trails but provides no explicit criteria for inclusion, diversity metrics, or validation that the matched benign sessions form an unbiased distribution. This directly affects the weakest assumption underlying the central claim that the benchmark is representative.

Authors: We acknowledge the need for greater transparency on benchmark construction. In the revised version we will specify the inclusion criteria (sourcing from public incident reports with requirements for multi-fragment kill chains), report diversity metrics (distribution over attack categories, temporal span, and target platforms), and detail the benign-session matching procedure together with validation steps (e.g., Kolmogorov-Smirnov tests on session-length and behavioral features) to support the claim of an unbiased distribution. Any remaining limitations in representativeness will be discussed explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation on held-out data.

full rationale

The paper constructs FragBench from 24 real-world campaigns, uses its own rewriter to produce fragments that evade single-turn judges by design, and reports empirical F1 scores (0.88-0.96) for GNN and classical-ML detectors on the resulting cross-session graphs. This is a direct measurement on a held-out test set rather than any derivation, equation, or prediction that reduces to fitted inputs or self-citations by construction. No load-bearing steps rely on self-citation for uniqueness, smuggle ansatzes, or rename known results; the 'by construction' clause applies only to the judge's per-fragment performance and does not force the cross-session detection outcome. The evaluation is falsifiable against external data and remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that real-world attack campaigns can be decomposed into individually benign fragments whose malicious intent only emerges across sessions, plus standard supervised-learning assumptions for the GNN and baseline models. No new physical or mathematical entities are introduced.

axioms (2)

domain assumption Fragments from the 24 campaigns can be rewritten to individually evade a single-turn safety judge while preserving the overall malicious goal when combined.
Invoked in the construction of the FragBench Attack task and the statement that the single-turn judge is near chance by construction.
domain assumption The interaction graph constructed from session fragments contains detectable cross-session features that distinguish malicious from benign trails.
Central to the FragBench Defense task and the claim that GNNs recover the feature.

pith-pipeline@v0.9.0 · 5578 in / 1528 out tokens · 51258 ms · 2026-05-13T01:09:56.079147+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

[1]

Andriushchenko et al

M. Andriushchenko et al. AgentHarm: A benchmark for measuring harmfulness of LLM agents, 2024

work page 2024
[2]

Anil et al

C. Anil et al. Many-shot jailbreaking. Technical report, Anthropic, 2024

work page 2024
[3]

Model context protocol.https://modelcontextprotocol.io/, 2024

Anthropic. Model context protocol.https://modelcontextprotocol.io/, 2024

work page 2024
[4]

Detecting and countering malicious uses of Claude: August 2025

Anthropic. Detecting and countering malicious uses of Claude: August 2025. Technical report, Anthropic, Aug. 2025

work page 2025
[5]

Disrupting the first reported AI-orchestrated cyber espionage campaign

Anthropic. Disrupting the first reported AI-orchestrated cyber espionage campaign. Technical report, Anthropic, Nov. 2025

work page 2025
[6]

System Card: Claude Mythos Preview.https://www-cdn.anthropic.com/ 08ab9158070959f88f296514c21b7facce6f52bc.pdf, 2026

Anthropic. System Card: Claude Mythos Preview.https://www-cdn.anthropic.com/ 08ab9158070959f88f296514c21b7facce6f52bc.pdf, 2026

work page 2026
[7]

Bhatt et al

M. Bhatt et al. Purple Llama CyberSecEval: A benchmark for evaluating the cybersecurity risks of large language models. Technical report, Meta AI Research, 2024

work page 2024
[8]

Bhatt et al

M. Bhatt et al. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large lan- guage models, 2024

work page 2024
[9]

P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramèr, H. Hassani, and E. Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models, 2024

work page 2024
[10]

Chao et al

P. Chao et al. Jailbreaking black box large language models in twenty queries, 2023

work page 2023
[11]

Eksombatchai et al

C. Eksombatchai et al. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. InProceedings of WWW, 2018

work page 2018
[12]

Fang et al

R. Fang et al. Teams of LLM agents can exploit zero-day vulnerabilities, 2024

work page 2024
[13]

J. H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001

work page 2001
[14]

Glukhov et al

D. Glukhov et al. LLM censorship: A machine learning challenge or a computer security problem?, 2023

work page 2023
[15]

GTIG AI threat tracker: Advances in threat-actor usage of AI tools

Google Threat Intelligence Group. GTIG AI threat tracker: Advances in threat-actor usage of AI tools. Technical report, Google Threat Intelligence Group, Nov. 2025. 10

work page 2025
[16]

GTIG AI threat tracker: Distillation, experimentation, and (continued) integration of AI for adversarial use

Google Threat Intelligence Group. GTIG AI threat tracker: Distillation, experimentation, and (continued) integration of AI for adversarial use. Technical report, Google Threat Intelligence Group, Feb. 2026

work page 2026
[17]

Guo et al

K. Guo et al. RedCode: Risky code execution and generation benchmark for code agents. In Proceedings of NeurIPS, 2024

work page 2024
[18]

W. L. Hamilton et al. Inductive representation learning on large graphs. InProceedings of NeurIPS, 2017

work page 2017
[19]

Kang et al

D. Kang et al. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. InProceedings of IEEE SaTML, 2024

work page 2024
[20]

T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. InProceedings of ICLR, 2017

work page 2017
[21]

Kumarappan and A

A. Kumarappan and A. Mujoo. Automating deception: Scalable multi-turn LLM jailbreaks. InNeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models, 2025

work page 2025
[22]

Li et al

X. Li et al. DrAttack: Prompt decomposition and reconstruction makes powerful LLMs jail- breakers. InFindings of EMNLP, 2024

work page 2024
[23]

W. W. Lo et al. Inspection-L: Self-supervised GNN node embeddings for money laundering detection in Bitcoin.Applied Intelligence, 2023

work page 2023
[24]

Mazeika et al

M. Mazeika et al. HarmBench: A standardised evaluation framework for automated red team- ing and robust refusal, 2024

work page 2024
[25]

Mehrotra et al

A. Mehrotra et al. Tree of attacks: Jailbreaking black-box LLMs automatically, 2023

work page 2023
[26]

Purple Llama CyberSecEval: A Secure Coding Benchmark for Large Language Models.https://meta-llama.github.io/PurpleLlama/CyberSecEval/, 2025

Meta Llama. Purple Llama CyberSecEval: A Secure Coding Benchmark for Large Language Models.https://meta-llama.github.io/PurpleLlama/CyberSecEval/, 2025

work page 2025
[27]

Microsoft digital defense report 2025

Microsoft. Microsoft digital defense report 2025. Technical report, Microsoft, Oct. 2025

work page 2025
[28]

Threat actor abuse of AI accelerates from tool to cyberattack surface

Microsoft Threat Intelligence. Threat actor abuse of AI accelerates from tool to cyberattack surface. Microsoft Security Blog, Apr. 2026

work page 2026
[29]

AI as tradecraft: How threat actors operationalize AI

Microsoft Threat Intelligence. AI as tradecraft: How threat actors operationalize AI. Microsoft Security Blog, Mar. 2026

work page 2026
[30]

Disrupting malicious uses of AI: June 2025

OpenAI. Disrupting malicious uses of AI: June 2025. Technical report, OpenAI, June 2025

work page 2025
[31]

Disrupting malicious uses of AI: October 2025

OpenAI. Disrupting malicious uses of AI: October 2025. Technical report, OpenAI, Oct. 2025

work page 2025
[32]

Disrupting malicious uses of AI: February 2026

OpenAI. Disrupting malicious uses of AI: February 2026. Technical report, OpenAI, Feb. 2026

work page 2026
[33]

The dual-use dilemma of AI: Malicious LLMs

Palo Alto Networks Unit 42. The dual-use dilemma of AI: Malicious LLMs. Unit 42, Nov. 2025

work page 2025
[34]

Red-APT.https://github.com/reinthal/Red-APT, 2026

reinthal. Red-APT.https://github.com/reinthal/Red-APT, 2026. GitHub repository, accessed March 7, 2026

work page 2026
[35]

Russinovich

M. Russinovich. Mitigating Skeleton Key, a new type of generative AI jailbreak technique. Microsoft Security Blog, June 2024

work page 2024
[36]

Russinovich et al

M. Russinovich et al. Great, now write an article about that: The Crescendo multi-turn LLM jailbreak attack, 2024

work page 2024
[37]

Schlichtkrull et al

M. Schlichtkrull et al. Modeling relational data with graph convolutional networks. InPro- ceedings of ESWC, 2018

work page 2018
[38]

Prompts as code & embedded keys: The hunt for LLM-enabled malware

SentinelLABS. Prompts as code & embedded keys: The hunt for LLM-enabled malware. SentinelLABS, Sept. 2025. 11

work page 2025
[39]

CyberGym: Evaluating AI agents’ real-world cybersecurity capabilities

UC Berkeley RDI. CyberGym: Evaluating AI agents’ real-world cybersecurity capabilities. Technical report, UC Berkeley Research Center for Decentralized Intelligence, 2025

work page 2025
[40]

Veliˇckovi´c et al

P. Veliˇckovi´c et al. Graph attention networks. InProceedings of ICLR, 2018

work page 2018
[41]

Wan et al

A. Wan et al. CyberSecEval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models. Technical report, Meta AI Research, 2024

work page 2024
[42]

Weber et al

M. Weber et al. Anti-money laundering in Bitcoin: Experimenting with graph convolutional networks for financial forensics. InKDD AMLD Workshop, 2019

work page 2019
[43]

Z. Weng, X. Jin, J. Jia, and X. Zhang. Foot-in-the-door: A multi-turn jailbreak for LLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1939–1950. Association for Computational Linguistics, 2025

work page 2025
[44]

Xu et al

K. Xu et al. How powerful are graph neural networks? InProceedings of ICLR, 2019

work page 2019
[45]

Yang et al

K.-C. Yang et al. Scalable and generalizable social bot detection through data selection. In Proceedings of AAAI, 2020

work page 2020
[46]

Yang et al

Y . Yang et al. SneakyPrompt: Jailbreaking text-to-image generative models. InProceedings of IEEE Security and Privacy, 2024

work page 2024
[47]

A. K. Zhang et al. Cybench: A framework for evaluating cybersecurity capabilities and risks of language models, 2024

work page 2024
[48]

Zhang et al

Y . Zhang et al. CVE-Bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities, 2025

work page 2025
[49]

Zou et al

A. Zou et al. Universal and transferable adversarial attacks on aligned language models, 2023. 12 A Threat Taxonomy This appendix lists the reference table that indexes the threat taxonomy used by the generator: the five attack scenarios that supply objectiveso(Table 4). Both are derived from the 35-incident corpus catalogued in Appendix F; the constructi...

work page 2023
[50]

fragment_id

Sensitive system prompts are included in the gated release; model and runtime settings are summarised here. Single-turn-prompt safety judges (§4).We evaluate the rewriter loop with Claude Sonnet 4.6, Claude Opus 4.6, and Llama Guard 4 as single-turn judges (Table 2). Each judge returns one of ALLOW,REFUSE, orAMBIGUOUS; the ternary reward{+1,−1,+0.3}for al...

work page 1918