pith. machine review for the scientific record. sign in

arxiv: 2605.11029 · v1 · submitted 2026-05-10 · 💻 cs.CR · cs.AI

Recognition: no theorem link

FragBench: Cross-Session Attacks Hidden in Benign-Looking Fragments

Arthur Coll\'e, Astha Mehta, Cedric Lam, David Williams-King, Hengxu Li, Jamie Johnson, Linh Le, My (Isabella) Luong, Niruthiha Selvanayagam, Olivia McGoffin, Phuc-Nguyen Nguyen, Raymond Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:09 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords FragBenchLLM safetycross-session attacksfragmented promptsgraph neural networksadversarial benchmarkingcybersecurityinteraction graphs
0
0 comments X

The pith

Graph models recover malicious signals in LLM fragments spread across separate sessions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds FragBench from 24 real cyber-incident campaigns to test attacks that split a harmful goal into sub-prompts, each of which looks benign to a single-turn safety judge. It creates paired tasks: an adversarial rewriter that hardens the fragments against isolated checks, and a graph-based detector trained on the cross-session interaction trails that the fragments leave behind. Four GNN variants plus three classical ML baselines all identify the underlying attack pattern at event-level F1 scores of 0.88 to 0.96, even though the single-turn judge scores near chance. A reader would care because current safety evaluations ignore multi-session context, leaving a gap that real attackers can exploit by distributing their intent over time.

Core claim

FragBench keeps the full attack trail from real campaigns, including the multi-fragment kill chain, per-fragment safety verdicts, sandbox execution traces, and matched benign sessions. The single-turn judge is defeated by construction, yet modeling the interaction graph among fragments from the same user allows both GNNs and classical ML methods to recover the cross-session malicious feature at high accuracy. This establishes that defense against fragmented LLM misuse requires graph-based reasoning over user-level interactions rather than evaluation of isolated prompts.

What carries the argument

The cross-session interaction graph, which links fragments from the same campaign into a single user trail so that collective malicious structure becomes visible even when no individual node triggers a safety flag.

If this is right

  • Single-turn safety judges are insufficient by design for detecting split malicious goals.
  • Graph-based detectors on user interaction data achieve F1 scores of 0.88-0.96 where isolated checks fail.
  • Effective defense requires shifting evaluation from individual prompts to multi-session interaction graphs.
  • The released rewriter, sandbox harness, and detector code enable systematic testing of new graph defenses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production LLM systems may need persistent user-session graphs to catch attacks that unfold over hours or days.
  • The same interaction-graph approach could be tested on other covert multi-step misuse patterns beyond cyber campaigns.
  • Adding execution-trace features to the graph nodes might allow earlier blocking before the full kill chain completes.

Load-bearing premise

Fragments from real campaigns can be rewritten to each pass a single-turn safety judge on its own while the combination across sessions remains malicious, and the chosen 24 campaigns plus benign matches form a representative test distribution.

What would settle it

Running the released GNN and classical detectors on a fresh collection of fragmented attack campaigns drawn from sources outside the original 24 and observing whether aggregate event-level F1 falls below 0.7 would falsify the claim that cross-session graph modeling reliably recovers the hidden malicious signal.

Figures

Figures reproduced from arXiv: 2605.11029 by Arthur Coll\'e, Astha Mehta, Cedric Lam, David Williams-King, Hengxu Li, Jamie Johnson, Linh Le, My (Isabella) Luong, Niruthiha Selvanayagam, Olivia McGoffin, Phuc-Nguyen Nguyen, Raymond Lee.

Figure 1
Figure 1. Figure 1: Overview of FragBench. Single-turn guardrails miss malicious composition spread across [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of the six rewriting styles that we used to disguise attack fragments. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MCP harness architecture. A central mcp-client ingests campaign JSON files, selects the attack based on the style and seed, and routes tool calls to a fleet of 24 specialised MCP servers via FastMCP (HTTP + SSE). All agent interactions are captured as structured .jsonl session logs served through a viewer at port 8787. Support is available for self-hosted LLMs as well as those accessed via openrouter [PIT… view at source ↗
Figure 4
Figure 4. Figure 4: MCP validation frontend views. The Live view streams fragment execution in real time, [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

An attacker can split a malicious goal into sub-prompts that each look benign on their own and only become harmful in combination. Existing LLM safety benchmarks evaluate prompts one at a time, or across turns of a single chat, and so do not look for a malicious signal spread across separate sessions with no shared context. We build FragBench, a benchmark drawn from 24 real-world cyber-incident campaigns, which keeps the full attack trail: the multi-fragment kill chain, the per-fragment safety-judge verdicts, sandboxed execution traces, and a matched set of benign cover sessions. FragBench splits this trail into two paired tasks: an adversarial rewriter that hardens fragments against a single-turn safety judge (FragBench Attack), and a graph-based user-level detector trained on the resulting interactions (FragBench Defense). The single-turn judge is near chance on the released corpus by construction, but four GNN variants and three classical-ML baselines all recover the cross-session feature, reaching aggregate event-level F1 = 0.88-0.96. Defending against fragmented LLM misuse therefore requires modeling the cross-session interaction graph, rather than isolated prompts. Our generator, rewriter, sandbox harness, and detector are released at https://github.com/LidaSafety/fragbench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces FragBench, a benchmark constructed from 24 real-world cyber-incident campaigns that preserves full attack trails including multi-fragment kill chains, per-fragment safety-judge verdicts, sandboxed execution traces, and matched benign sessions. It defines two paired tasks: FragBench Attack, an adversarial rewriter that hardens individual fragments to evade a single-turn safety judge while keeping the combination malicious, and FragBench Defense, a graph-based detector (four GNN variants plus three classical-ML baselines) trained on the resulting cross-session interaction graphs. The single-turn judge performs near chance by construction on the released corpus, yet the detectors recover the cross-session malicious signal and achieve aggregate event-level F1 scores of 0.88-0.96. The work concludes that defending against fragmented LLM misuse requires modeling the cross-session interaction graph rather than isolated prompts, and releases the generator, rewriter, sandbox harness, and detector.

Significance. If the benchmark construction avoids systematic artifacts and the high F1 scores reflect genuine cross-session features rather than rewriter-induced patterns, the result is significant: it identifies a previously unaddressed threat model in LLM safety (multi-session fragmentation with no shared context) and demonstrates that graph-based models can detect it where single-turn or single-session approaches fail. The explicit release of code, rewriter, and harness is a clear strength that enables reproducibility and follow-on work. The empirical demonstration across multiple model families strengthens the case that cross-session modeling is necessary.

major comments (3)
  1. [§4, §5.1] §4 (experimental setup) and §5.1 (results): the reported aggregate event-level F1 range of 0.88-0.96 is presented without error bars, statistical significance tests, or explicit description of how the 24 campaigns were partitioned into train/validation/test sets. This makes it impossible to determine whether the cross-session recovery generalizes or reflects campaign-specific selection effects.
  2. [§3.2] §3.2 (FragBench Attack rewriter): no ablation is reported on rewriter variants, alternative single-turn judges, or feature-importance analysis that isolates inter-fragment relational signals from intra-fragment lexical or structural artifacts introduced by the rewriter itself. Without such controls, it remains possible that both GNNs and classical baselines exploit consistent rewriting patterns rather than a general cross-session malicious feature.
  3. [§3.1] §3.1 (campaign selection): the manuscript states that the 24 campaigns are drawn from real-world cyber-incident trails but provides no explicit criteria for inclusion, diversity metrics, or validation that the matched benign sessions form an unbiased distribution. This directly affects the weakest assumption underlying the central claim that the benchmark is representative.
minor comments (2)
  1. [abstract, §5] The abstract and §5 refer to 'event-level F1' without a precise definition or pseudocode for how events are aggregated from fragment-level predictions.
  2. [§5] Table captions and axis labels in the results figures should explicitly state the number of campaigns and sessions used for each reported metric.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, agreeing where revisions are needed to improve clarity and rigor, and commit to incorporating these changes in the revised manuscript.

read point-by-point responses
  1. Referee: [§4, §5.1] §4 (experimental setup) and §5.1 (results): the reported aggregate event-level F1 range of 0.88-0.96 is presented without error bars, statistical significance tests, or explicit description of how the 24 campaigns were partitioned into train/validation/test sets. This makes it impossible to determine whether the cross-session recovery generalizes or reflects campaign-specific selection effects.

    Authors: We agree that these details are essential for evaluating generalization and were insufficiently reported. In the revised manuscript we will add error bars computed via bootstrapping over campaigns and 5-fold cross-validation, include statistical significance tests (e.g., McNemar’s test and paired t-tests) comparing GNN variants against classical baselines, and explicitly describe the partitioning (campaign-level splits with no session leakage across folds). We will also report per-campaign F1 scores to demonstrate consistency rather than aggregate-only results. revision: yes

  2. Referee: [§3.2] §3.2 (FragBench Attack rewriter): no ablation is reported on rewriter variants, alternative single-turn judges, or feature-importance analysis that isolates inter-fragment relational signals from intra-fragment lexical or structural artifacts introduced by the rewriter itself. Without such controls, it remains possible that both GNNs and classical baselines exploit consistent rewriting patterns rather than a general cross-session malicious feature.

    Authors: This is a valid concern about potential confounding. We will add the requested controls in the revision: ablations across rewriter variants (different adversarial objectives and prompt templates), evaluation using alternative single-turn safety judges, and feature-importance analysis (GNN attention weights and permutation importance) that quantifies the contribution of inter-fragment edges versus intra-fragment node features. These additions will help isolate the cross-session relational signal from rewriter-induced artifacts. revision: yes

  3. Referee: [§3.1] §3.1 (campaign selection): the manuscript states that the 24 campaigns are drawn from real-world cyber-incident trails but provides no explicit criteria for inclusion, diversity metrics, or validation that the matched benign sessions form an unbiased distribution. This directly affects the weakest assumption underlying the central claim that the benchmark is representative.

    Authors: We acknowledge the need for greater transparency on benchmark construction. In the revised version we will specify the inclusion criteria (sourcing from public incident reports with requirements for multi-fragment kill chains), report diversity metrics (distribution over attack categories, temporal span, and target platforms), and detail the benign-session matching procedure together with validation steps (e.g., Kolmogorov-Smirnov tests on session-length and behavioral features) to support the claim of an unbiased distribution. Any remaining limitations in representativeness will be discussed explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation on held-out data.

full rationale

The paper constructs FragBench from 24 real-world campaigns, uses its own rewriter to produce fragments that evade single-turn judges by design, and reports empirical F1 scores (0.88-0.96) for GNN and classical-ML detectors on the resulting cross-session graphs. This is a direct measurement on a held-out test set rather than any derivation, equation, or prediction that reduces to fitted inputs or self-citations by construction. No load-bearing steps rely on self-citation for uniqueness, smuggle ansatzes, or rename known results; the 'by construction' clause applies only to the judge's per-fragment performance and does not force the cross-session detection outcome. The evaluation is falsifiable against external data and remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that real-world attack campaigns can be decomposed into individually benign fragments whose malicious intent only emerges across sessions, plus standard supervised-learning assumptions for the GNN and baseline models. No new physical or mathematical entities are introduced.

axioms (2)
  • domain assumption Fragments from the 24 campaigns can be rewritten to individually evade a single-turn safety judge while preserving the overall malicious goal when combined.
    Invoked in the construction of the FragBench Attack task and the statement that the single-turn judge is near chance by construction.
  • domain assumption The interaction graph constructed from session fragments contains detectable cross-session features that distinguish malicious from benign trails.
    Central to the FragBench Defense task and the claim that GNNs recover the feature.

pith-pipeline@v0.9.0 · 5578 in / 1528 out tokens · 51258 ms · 2026-05-13T01:09:56.079147+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    Andriushchenko et al

    M. Andriushchenko et al. AgentHarm: A benchmark for measuring harmfulness of LLM agents, 2024

  2. [2]

    Anil et al

    C. Anil et al. Many-shot jailbreaking. Technical report, Anthropic, 2024

  3. [3]

    Model context protocol.https://modelcontextprotocol.io/, 2024

    Anthropic. Model context protocol.https://modelcontextprotocol.io/, 2024

  4. [4]

    Detecting and countering malicious uses of Claude: August 2025

    Anthropic. Detecting and countering malicious uses of Claude: August 2025. Technical report, Anthropic, Aug. 2025

  5. [5]

    Disrupting the first reported AI-orchestrated cyber espionage campaign

    Anthropic. Disrupting the first reported AI-orchestrated cyber espionage campaign. Technical report, Anthropic, Nov. 2025

  6. [6]

    System Card: Claude Mythos Preview.https://www-cdn.anthropic.com/ 08ab9158070959f88f296514c21b7facce6f52bc.pdf, 2026

    Anthropic. System Card: Claude Mythos Preview.https://www-cdn.anthropic.com/ 08ab9158070959f88f296514c21b7facce6f52bc.pdf, 2026

  7. [7]

    Bhatt et al

    M. Bhatt et al. Purple Llama CyberSecEval: A benchmark for evaluating the cybersecurity risks of large language models. Technical report, Meta AI Research, 2024

  8. [8]

    Bhatt et al

    M. Bhatt et al. CyberSecEval 2: A wide-ranging cybersecurity evaluation suite for large lan- guage models, 2024

  9. [9]

    P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramèr, H. Hassani, and E. Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models, 2024

  10. [10]

    Chao et al

    P. Chao et al. Jailbreaking black box large language models in twenty queries, 2023

  11. [11]

    Eksombatchai et al

    C. Eksombatchai et al. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. InProceedings of WWW, 2018

  12. [12]

    Fang et al

    R. Fang et al. Teams of LLM agents can exploit zero-day vulnerabilities, 2024

  13. [13]

    J. H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001

  14. [14]

    Glukhov et al

    D. Glukhov et al. LLM censorship: A machine learning challenge or a computer security problem?, 2023

  15. [15]

    GTIG AI threat tracker: Advances in threat-actor usage of AI tools

    Google Threat Intelligence Group. GTIG AI threat tracker: Advances in threat-actor usage of AI tools. Technical report, Google Threat Intelligence Group, Nov. 2025. 10

  16. [16]

    GTIG AI threat tracker: Distillation, experimentation, and (continued) integration of AI for adversarial use

    Google Threat Intelligence Group. GTIG AI threat tracker: Distillation, experimentation, and (continued) integration of AI for adversarial use. Technical report, Google Threat Intelligence Group, Feb. 2026

  17. [17]

    Guo et al

    K. Guo et al. RedCode: Risky code execution and generation benchmark for code agents. In Proceedings of NeurIPS, 2024

  18. [18]

    W. L. Hamilton et al. Inductive representation learning on large graphs. InProceedings of NeurIPS, 2017

  19. [19]

    Kang et al

    D. Kang et al. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. InProceedings of IEEE SaTML, 2024

  20. [20]

    T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. InProceedings of ICLR, 2017

  21. [21]

    Kumarappan and A

    A. Kumarappan and A. Mujoo. Automating deception: Scalable multi-turn LLM jailbreaks. InNeurIPS 2025 Workshop on Multi-Turn Interactions in Large Language Models, 2025

  22. [22]

    Li et al

    X. Li et al. DrAttack: Prompt decomposition and reconstruction makes powerful LLMs jail- breakers. InFindings of EMNLP, 2024

  23. [23]

    W. W. Lo et al. Inspection-L: Self-supervised GNN node embeddings for money laundering detection in Bitcoin.Applied Intelligence, 2023

  24. [24]

    Mazeika et al

    M. Mazeika et al. HarmBench: A standardised evaluation framework for automated red team- ing and robust refusal, 2024

  25. [25]

    Mehrotra et al

    A. Mehrotra et al. Tree of attacks: Jailbreaking black-box LLMs automatically, 2023

  26. [26]

    Purple Llama CyberSecEval: A Secure Coding Benchmark for Large Language Models.https://meta-llama.github.io/PurpleLlama/CyberSecEval/, 2025

    Meta Llama. Purple Llama CyberSecEval: A Secure Coding Benchmark for Large Language Models.https://meta-llama.github.io/PurpleLlama/CyberSecEval/, 2025

  27. [27]

    Microsoft digital defense report 2025

    Microsoft. Microsoft digital defense report 2025. Technical report, Microsoft, Oct. 2025

  28. [28]

    Threat actor abuse of AI accelerates from tool to cyberattack surface

    Microsoft Threat Intelligence. Threat actor abuse of AI accelerates from tool to cyberattack surface. Microsoft Security Blog, Apr. 2026

  29. [29]

    AI as tradecraft: How threat actors operationalize AI

    Microsoft Threat Intelligence. AI as tradecraft: How threat actors operationalize AI. Microsoft Security Blog, Mar. 2026

  30. [30]

    Disrupting malicious uses of AI: June 2025

    OpenAI. Disrupting malicious uses of AI: June 2025. Technical report, OpenAI, June 2025

  31. [31]

    Disrupting malicious uses of AI: October 2025

    OpenAI. Disrupting malicious uses of AI: October 2025. Technical report, OpenAI, Oct. 2025

  32. [32]

    Disrupting malicious uses of AI: February 2026

    OpenAI. Disrupting malicious uses of AI: February 2026. Technical report, OpenAI, Feb. 2026

  33. [33]

    The dual-use dilemma of AI: Malicious LLMs

    Palo Alto Networks Unit 42. The dual-use dilemma of AI: Malicious LLMs. Unit 42, Nov. 2025

  34. [34]

    Red-APT.https://github.com/reinthal/Red-APT, 2026

    reinthal. Red-APT.https://github.com/reinthal/Red-APT, 2026. GitHub repository, accessed March 7, 2026

  35. [35]

    Russinovich

    M. Russinovich. Mitigating Skeleton Key, a new type of generative AI jailbreak technique. Microsoft Security Blog, June 2024

  36. [36]

    Russinovich et al

    M. Russinovich et al. Great, now write an article about that: The Crescendo multi-turn LLM jailbreak attack, 2024

  37. [37]

    Schlichtkrull et al

    M. Schlichtkrull et al. Modeling relational data with graph convolutional networks. InPro- ceedings of ESWC, 2018

  38. [38]

    Prompts as code & embedded keys: The hunt for LLM-enabled malware

    SentinelLABS. Prompts as code & embedded keys: The hunt for LLM-enabled malware. SentinelLABS, Sept. 2025. 11

  39. [39]

    CyberGym: Evaluating AI agents’ real-world cybersecurity capabilities

    UC Berkeley RDI. CyberGym: Evaluating AI agents’ real-world cybersecurity capabilities. Technical report, UC Berkeley Research Center for Decentralized Intelligence, 2025

  40. [40]

    Veliˇckovi´c et al

    P. Veliˇckovi´c et al. Graph attention networks. InProceedings of ICLR, 2018

  41. [41]

    Wan et al

    A. Wan et al. CyberSecEval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models. Technical report, Meta AI Research, 2024

  42. [42]

    Weber et al

    M. Weber et al. Anti-money laundering in Bitcoin: Experimenting with graph convolutional networks for financial forensics. InKDD AMLD Workshop, 2019

  43. [43]

    Z. Weng, X. Jin, J. Jia, and X. Zhang. Foot-in-the-door: A multi-turn jailbreak for LLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1939–1950. Association for Computational Linguistics, 2025

  44. [44]

    Xu et al

    K. Xu et al. How powerful are graph neural networks? InProceedings of ICLR, 2019

  45. [45]

    Yang et al

    K.-C. Yang et al. Scalable and generalizable social bot detection through data selection. In Proceedings of AAAI, 2020

  46. [46]

    Yang et al

    Y . Yang et al. SneakyPrompt: Jailbreaking text-to-image generative models. InProceedings of IEEE Security and Privacy, 2024

  47. [47]

    A. K. Zhang et al. Cybench: A framework for evaluating cybersecurity capabilities and risks of language models, 2024

  48. [48]

    Zhang et al

    Y . Zhang et al. CVE-Bench: A benchmark for AI agents’ ability to exploit real-world web application vulnerabilities, 2025

  49. [49]

    Zou et al

    A. Zou et al. Universal and transferable adversarial attacks on aligned language models, 2023. 12 A Threat Taxonomy This appendix lists the reference table that indexes the threat taxonomy used by the generator: the five attack scenarios that supply objectiveso(Table 4). Both are derived from the 35-incident corpus catalogued in Appendix F; the constructi...

  50. [50]

    fragment_id

    Sensitive system prompts are included in the gated release; model and runtime settings are summarised here. Single-turn-prompt safety judges (§4).We evaluate the rewriter loop with Claude Sonnet 4.6, Claude Opus 4.6, and Llama Guard 4 as single-turn judges (Table 2). Each judge returns one of ALLOW,REFUSE, orAMBIGUOUS; the ternary reward{+1,−1,+0.3}for al...