Recognition: no theorem link
FragBench: Cross-Session Attacks Hidden in Benign-Looking Fragments
Pith reviewed 2026-05-13 01:09 UTC · model grok-4.3
The pith
Graph models recover malicious signals in LLM fragments spread across separate sessions
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FragBench keeps the full attack trail from real campaigns, including the multi-fragment kill chain, per-fragment safety verdicts, sandbox execution traces, and matched benign sessions. The single-turn judge is defeated by construction, yet modeling the interaction graph among fragments from the same user allows both GNNs and classical ML methods to recover the cross-session malicious feature at high accuracy. This establishes that defense against fragmented LLM misuse requires graph-based reasoning over user-level interactions rather than evaluation of isolated prompts.
What carries the argument
The cross-session interaction graph, which links fragments from the same campaign into a single user trail so that collective malicious structure becomes visible even when no individual node triggers a safety flag.
If this is right
- Single-turn safety judges are insufficient by design for detecting split malicious goals.
- Graph-based detectors on user interaction data achieve F1 scores of 0.88-0.96 where isolated checks fail.
- Effective defense requires shifting evaluation from individual prompts to multi-session interaction graphs.
- The released rewriter, sandbox harness, and detector code enable systematic testing of new graph defenses.
Where Pith is reading between the lines
- Production LLM systems may need persistent user-session graphs to catch attacks that unfold over hours or days.
- The same interaction-graph approach could be tested on other covert multi-step misuse patterns beyond cyber campaigns.
- Adding execution-trace features to the graph nodes might allow earlier blocking before the full kill chain completes.
Load-bearing premise
Fragments from real campaigns can be rewritten to each pass a single-turn safety judge on its own while the combination across sessions remains malicious, and the chosen 24 campaigns plus benign matches form a representative test distribution.
What would settle it
Running the released GNN and classical detectors on a fresh collection of fragmented attack campaigns drawn from sources outside the original 24 and observing whether aggregate event-level F1 falls below 0.7 would falsify the claim that cross-session graph modeling reliably recovers the hidden malicious signal.
Figures
read the original abstract
An attacker can split a malicious goal into sub-prompts that each look benign on their own and only become harmful in combination. Existing LLM safety benchmarks evaluate prompts one at a time, or across turns of a single chat, and so do not look for a malicious signal spread across separate sessions with no shared context. We build FragBench, a benchmark drawn from 24 real-world cyber-incident campaigns, which keeps the full attack trail: the multi-fragment kill chain, the per-fragment safety-judge verdicts, sandboxed execution traces, and a matched set of benign cover sessions. FragBench splits this trail into two paired tasks: an adversarial rewriter that hardens fragments against a single-turn safety judge (FragBench Attack), and a graph-based user-level detector trained on the resulting interactions (FragBench Defense). The single-turn judge is near chance on the released corpus by construction, but four GNN variants and three classical-ML baselines all recover the cross-session feature, reaching aggregate event-level F1 = 0.88-0.96. Defending against fragmented LLM misuse therefore requires modeling the cross-session interaction graph, rather than isolated prompts. Our generator, rewriter, sandbox harness, and detector are released at https://github.com/LidaSafety/fragbench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FragBench, a benchmark constructed from 24 real-world cyber-incident campaigns that preserves full attack trails including multi-fragment kill chains, per-fragment safety-judge verdicts, sandboxed execution traces, and matched benign sessions. It defines two paired tasks: FragBench Attack, an adversarial rewriter that hardens individual fragments to evade a single-turn safety judge while keeping the combination malicious, and FragBench Defense, a graph-based detector (four GNN variants plus three classical-ML baselines) trained on the resulting cross-session interaction graphs. The single-turn judge performs near chance by construction on the released corpus, yet the detectors recover the cross-session malicious signal and achieve aggregate event-level F1 scores of 0.88-0.96. The work concludes that defending against fragmented LLM misuse requires modeling the cross-session interaction graph rather than isolated prompts, and releases the generator, rewriter, sandbox harness, and detector.
Significance. If the benchmark construction avoids systematic artifacts and the high F1 scores reflect genuine cross-session features rather than rewriter-induced patterns, the result is significant: it identifies a previously unaddressed threat model in LLM safety (multi-session fragmentation with no shared context) and demonstrates that graph-based models can detect it where single-turn or single-session approaches fail. The explicit release of code, rewriter, and harness is a clear strength that enables reproducibility and follow-on work. The empirical demonstration across multiple model families strengthens the case that cross-session modeling is necessary.
major comments (3)
- [§4, §5.1] §4 (experimental setup) and §5.1 (results): the reported aggregate event-level F1 range of 0.88-0.96 is presented without error bars, statistical significance tests, or explicit description of how the 24 campaigns were partitioned into train/validation/test sets. This makes it impossible to determine whether the cross-session recovery generalizes or reflects campaign-specific selection effects.
- [§3.2] §3.2 (FragBench Attack rewriter): no ablation is reported on rewriter variants, alternative single-turn judges, or feature-importance analysis that isolates inter-fragment relational signals from intra-fragment lexical or structural artifacts introduced by the rewriter itself. Without such controls, it remains possible that both GNNs and classical baselines exploit consistent rewriting patterns rather than a general cross-session malicious feature.
- [§3.1] §3.1 (campaign selection): the manuscript states that the 24 campaigns are drawn from real-world cyber-incident trails but provides no explicit criteria for inclusion, diversity metrics, or validation that the matched benign sessions form an unbiased distribution. This directly affects the weakest assumption underlying the central claim that the benchmark is representative.
minor comments (2)
- [abstract, §5] The abstract and §5 refer to 'event-level F1' without a precise definition or pseudocode for how events are aggregated from fragment-level predictions.
- [§5] Table captions and axis labels in the results figures should explicitly state the number of campaigns and sessions used for each reported metric.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, agreeing where revisions are needed to improve clarity and rigor, and commit to incorporating these changes in the revised manuscript.
read point-by-point responses
-
Referee: [§4, §5.1] §4 (experimental setup) and §5.1 (results): the reported aggregate event-level F1 range of 0.88-0.96 is presented without error bars, statistical significance tests, or explicit description of how the 24 campaigns were partitioned into train/validation/test sets. This makes it impossible to determine whether the cross-session recovery generalizes or reflects campaign-specific selection effects.
Authors: We agree that these details are essential for evaluating generalization and were insufficiently reported. In the revised manuscript we will add error bars computed via bootstrapping over campaigns and 5-fold cross-validation, include statistical significance tests (e.g., McNemar’s test and paired t-tests) comparing GNN variants against classical baselines, and explicitly describe the partitioning (campaign-level splits with no session leakage across folds). We will also report per-campaign F1 scores to demonstrate consistency rather than aggregate-only results. revision: yes
-
Referee: [§3.2] §3.2 (FragBench Attack rewriter): no ablation is reported on rewriter variants, alternative single-turn judges, or feature-importance analysis that isolates inter-fragment relational signals from intra-fragment lexical or structural artifacts introduced by the rewriter itself. Without such controls, it remains possible that both GNNs and classical baselines exploit consistent rewriting patterns rather than a general cross-session malicious feature.
Authors: This is a valid concern about potential confounding. We will add the requested controls in the revision: ablations across rewriter variants (different adversarial objectives and prompt templates), evaluation using alternative single-turn safety judges, and feature-importance analysis (GNN attention weights and permutation importance) that quantifies the contribution of inter-fragment edges versus intra-fragment node features. These additions will help isolate the cross-session relational signal from rewriter-induced artifacts. revision: yes
-
Referee: [§3.1] §3.1 (campaign selection): the manuscript states that the 24 campaigns are drawn from real-world cyber-incident trails but provides no explicit criteria for inclusion, diversity metrics, or validation that the matched benign sessions form an unbiased distribution. This directly affects the weakest assumption underlying the central claim that the benchmark is representative.
Authors: We acknowledge the need for greater transparency on benchmark construction. In the revised version we will specify the inclusion criteria (sourcing from public incident reports with requirements for multi-fragment kill chains), report diversity metrics (distribution over attack categories, temporal span, and target platforms), and detail the benign-session matching procedure together with validation steps (e.g., Kolmogorov-Smirnov tests on session-length and behavioral features) to support the claim of an unbiased distribution. Any remaining limitations in representativeness will be discussed explicitly. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation on held-out data.
full rationale
The paper constructs FragBench from 24 real-world campaigns, uses its own rewriter to produce fragments that evade single-turn judges by design, and reports empirical F1 scores (0.88-0.96) for GNN and classical-ML detectors on the resulting cross-session graphs. This is a direct measurement on a held-out test set rather than any derivation, equation, or prediction that reduces to fitted inputs or self-citations by construction. No load-bearing steps rely on self-citation for uniqueness, smuggle ansatzes, or rename known results; the 'by construction' clause applies only to the judge's per-fragment performance and does not force the cross-session detection outcome. The evaluation is falsifiable against external data and remains self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Fragments from the 24 campaigns can be rewritten to individually evade a single-turn safety judge while preserving the overall malicious goal when combined.
- domain assumption The interaction graph constructed from session fragments contains detectable cross-session features that distinguish malicious from benign trails.
Reference graph
Works this paper leans on
-
[1]
M. Andriushchenko et al. AgentHarm: A benchmark for measuring harmfulness of LLM agents, 2024
work page 2024
- [2]
-
[3]
Model context protocol.https://modelcontextprotocol.io/, 2024
Anthropic. Model context protocol.https://modelcontextprotocol.io/, 2024
work page 2024
-
[4]
Detecting and countering malicious uses of Claude: August 2025
Anthropic. Detecting and countering malicious uses of Claude: August 2025. Technical report, Anthropic, Aug. 2025
work page 2025
-
[5]
Disrupting the first reported AI-orchestrated cyber espionage campaign
Anthropic. Disrupting the first reported AI-orchestrated cyber espionage campaign. Technical report, Anthropic, Nov. 2025
work page 2025
-
[6]
Anthropic. System Card: Claude Mythos Preview.https://www-cdn.anthropic.com/ 08ab9158070959f88f296514c21b7facce6f52bc.pdf, 2026
work page 2026
-
[7]
M. Bhatt et al. Purple Llama CyberSecEval: A benchmark for evaluating the cybersecurity risks of large language models. Technical report, Meta AI Research, 2024
work page 2024
-
[8]
M. Bhatt et al. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large lan- guage models, 2024
work page 2024
-
[9]
P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramèr, H. Hassani, and E. Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models, 2024
work page 2024
-
[10]
P. Chao et al. Jailbreaking black box large language models in twenty queries, 2023
work page 2023
-
[11]
C. Eksombatchai et al. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. InProceedings of WWW, 2018
work page 2018
-
[12]
R. Fang et al. Teams of LLM agents can exploit zero-day vulnerabilities, 2024
work page 2024
-
[13]
J. H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001
work page 2001
-
[14]
D. Glukhov et al. LLM censorship: A machine learning challenge or a computer security problem?, 2023
work page 2023
-
[15]
GTIG AI threat tracker: Advances in threat-actor usage of AI tools
Google Threat Intelligence Group. GTIG AI threat tracker: Advances in threat-actor usage of AI tools. Technical report, Google Threat Intelligence Group, Nov. 2025. 10
work page 2025
-
[16]
Google Threat Intelligence Group. GTIG AI threat tracker: Distillation, experimentation, and (continued) integration of AI for adversarial use. Technical report, Google Threat Intelligence Group, Feb. 2026
work page 2026
- [17]
-
[18]
W. L. Hamilton et al. Inductive representation learning on large graphs. InProceedings of NeurIPS, 2017
work page 2017
-
[19]
D. Kang et al. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. InProceedings of IEEE SaTML, 2024
work page 2024
-
[20]
T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. InProceedings of ICLR, 2017
work page 2017
-
[21]
A. Kumarappan and A. Mujoo. Automating deception: Scalable multi-turn LLM jailbreaks. InNeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models, 2025
work page 2025
- [22]
-
[23]
W. W. Lo et al. Inspection-L: Self-supervised GNN node embeddings for money laundering detection in Bitcoin.Applied Intelligence, 2023
work page 2023
-
[24]
M. Mazeika et al. HarmBench: A standardised evaluation framework for automated red team- ing and robust refusal, 2024
work page 2024
-
[25]
A. Mehrotra et al. Tree of attacks: Jailbreaking black-box LLMs automatically, 2023
work page 2023
-
[26]
Meta Llama. Purple Llama CyberSecEval: A Secure Coding Benchmark for Large Language Models.https://meta-llama.github.io/PurpleLlama/CyberSecEval/, 2025
work page 2025
-
[27]
Microsoft digital defense report 2025
Microsoft. Microsoft digital defense report 2025. Technical report, Microsoft, Oct. 2025
work page 2025
-
[28]
Threat actor abuse of AI accelerates from tool to cyberattack surface
Microsoft Threat Intelligence. Threat actor abuse of AI accelerates from tool to cyberattack surface. Microsoft Security Blog, Apr. 2026
work page 2026
-
[29]
AI as tradecraft: How threat actors operationalize AI
Microsoft Threat Intelligence. AI as tradecraft: How threat actors operationalize AI. Microsoft Security Blog, Mar. 2026
work page 2026
-
[30]
Disrupting malicious uses of AI: June 2025
OpenAI. Disrupting malicious uses of AI: June 2025. Technical report, OpenAI, June 2025
work page 2025
-
[31]
Disrupting malicious uses of AI: October 2025
OpenAI. Disrupting malicious uses of AI: October 2025. Technical report, OpenAI, Oct. 2025
work page 2025
-
[32]
Disrupting malicious uses of AI: February 2026
OpenAI. Disrupting malicious uses of AI: February 2026. Technical report, OpenAI, Feb. 2026
work page 2026
-
[33]
The dual-use dilemma of AI: Malicious LLMs
Palo Alto Networks Unit 42. The dual-use dilemma of AI: Malicious LLMs. Unit 42, Nov. 2025
work page 2025
-
[34]
Red-APT.https://github.com/reinthal/Red-APT, 2026
reinthal. Red-APT.https://github.com/reinthal/Red-APT, 2026. GitHub repository, accessed March 7, 2026
work page 2026
-
[35]
M. Russinovich. Mitigating Skeleton Key, a new type of generative AI jailbreak technique. Microsoft Security Blog, June 2024
work page 2024
-
[36]
M. Russinovich et al. Great, now write an article about that: The Crescendo multi-turn LLM jailbreak attack, 2024
work page 2024
-
[37]
M. Schlichtkrull et al. Modeling relational data with graph convolutional networks. InPro- ceedings of ESWC, 2018
work page 2018
-
[38]
Prompts as code & embedded keys: The hunt for LLM-enabled malware
SentinelLABS. Prompts as code & embedded keys: The hunt for LLM-enabled malware. SentinelLABS, Sept. 2025. 11
work page 2025
-
[39]
CyberGym: Evaluating AI agents’ real-world cybersecurity capabilities
UC Berkeley RDI. CyberGym: Evaluating AI agents’ real-world cybersecurity capabilities. Technical report, UC Berkeley Research Center for Decentralized Intelligence, 2025
work page 2025
-
[40]
P. Veliˇckovi´c et al. Graph attention networks. InProceedings of ICLR, 2018
work page 2018
- [41]
-
[42]
M. Weber et al. Anti-money laundering in Bitcoin: Experimenting with graph convolutional networks for financial forensics. InKDD AMLD Workshop, 2019
work page 2019
-
[43]
Z. Weng, X. Jin, J. Jia, and X. Zhang. Foot-in-the-door: A multi-turn jailbreak for LLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1939–1950. Association for Computational Linguistics, 2025
work page 2025
- [44]
-
[45]
K.-C. Yang et al. Scalable and generalizable social bot detection through data selection. In Proceedings of AAAI, 2020
work page 2020
-
[46]
Y . Yang et al. SneakyPrompt: Jailbreaking text-to-image generative models. InProceedings of IEEE Security and Privacy, 2024
work page 2024
-
[47]
A. K. Zhang et al. Cybench: A framework for evaluating cybersecurity capabilities and risks of language models, 2024
work page 2024
-
[48]
Y . Zhang et al. CVE-Bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities, 2025
work page 2025
-
[49]
A. Zou et al. Universal and transferable adversarial attacks on aligned language models, 2023. 12 A Threat Taxonomy This appendix lists the reference table that indexes the threat taxonomy used by the generator: the five attack scenarios that supply objectiveso(Table 4). Both are derived from the 35-incident corpus catalogued in Appendix F; the constructi...
work page 2023
-
[50]
Sensitive system prompts are included in the gated release; model and runtime settings are summarised here. Single-turn-prompt safety judges (§4).We evaluate the rewriter loop with Claude Sonnet 4.6, Claude Opus 4.6, and Llama Guard 4 as single-turn judges (Table 2). Each judge returns one of ALLOW,REFUSE, orAMBIGUOUS; the ternary reward{+1,−1,+0.3}for al...
work page 1918
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.