Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection
Pith reviewed 2026-05-07 07:00 UTC · model grok-4.3
The pith
Multi-turn attacks produce excess activation path lengths in the LLM residual stream that five trajectory features can detect at 93.8 percent accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The attack path of trust-building, pivoting and escalation produces phase shifts in residual stream activations that yield substantially longer total path lengths than benign conversations. Five scalar trajectory features extracted from these paths enable conversation-level detection at 93.8 percent accuracy on synthetic held-out data, up from 76.2 percent. This adversarial restlessness signal replicates across four model families (24B-70B), yet probes are architecture-specific and do not transfer. Three-phase turn-level labels are required to keep false positives low; combined training on synthetic, LMSYS-Chat-1M and SafeDialBench sources yields 89.4 percent detection at 2.4 percent false-0
What carries the argument
Adversarial restlessness: the excess total path length in residual-stream activation trajectories caused by the phase shifts of trust-building, pivoting and escalation.
If this is right
- Detectors can flag attacks at the conversation level without needing to label any single turn as malicious in text.
- Probes must be trained separately for each model family rather than deployed as universal detectors.
- Three-phase turn labels are required during training; binary conversation labels alone produce unacceptable false-positive rates.
- Combining synthetic, real chat log and benchmark sources during training improves performance on held-out mixed distributions.
Where Pith is reading between the lines
- The method could be inserted into an inference stack to monitor live conversations by periodically sampling residual activations.
- Attacks engineered to avoid abrupt phase shifts or to mimic natural topic changes might shrink the path-length signal and reduce detection rates.
- The requirement for diverse training sources implies that maintaining performance will need continuous collection of attack examples from multiple environments.
Load-bearing premise
The observed longer activation paths and trajectory features are caused specifically by the attack phases rather than other conversation dynamics, and the synthetic attack distributions represent real-world attacks well enough for the method to generalize.
What would settle it
A large set of benign multi-turn conversations that contain multiple topic shifts or long interactions but no adversarial intent, measured for activation path lengths, would falsify the claim if their lengths routinely exceed the attack threshold.
Figures
read the original abstract
Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detection from 76.2% to 93.8% on synthetic held-out data. The signal replicates across four model families (24B-70B); probes are model-specific and do not transfer across architectures. Generalization is source-dependent: leave-one-source-out evaluation shows each of synthetic, LMSYS-Chat-1M, and SafeDialBench captures distinct attack distributions, with detection on real-world LMSYS reaching 47-71% when its distribution is represented in training. Combined three-source training achieves 89.4% detection at 2.4% false positive rate on a held-out mixed set. We further show that three-phase turn-level labels(benign/pivoting/adversarial) unique to our synthetic dataset are essential: binary conversation-level labels produce 50-59% false positives. These results establish adversarial restlessness as a reliable activation-level signal and characterize the data requirements for practical deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that multi-turn prompt injection attacks follow a three-phase path (trust-building, pivoting, escalation) that produces a detectable 'adversarial restlessness' signature in LLM residual-stream activations, quantified by total path length and five scalar trajectory features. These features improve conversation-level detection from 76.2% to 93.8% on synthetic held-out data, replicate across four model families (24B-70B), require model-specific probes, and show source-dependent generalization (47-71% on LMSYS when represented in training); combined three-source training reaches 89.4% detection at 2.4% FPR. Three-phase turn-level labels are shown to be essential, as binary labels yield 50-59% false positives.
Significance. If the trajectory features specifically encode attack-phase shifts rather than generic multi-turn dynamics, the work offers a new activation-level defense against covert multi-turn attacks that evade text-based detectors. Strengths include the cross-model replication, the explicit demonstration that three-phase labeling is required, and the characterization of data-source requirements for practical deployment. The concrete performance numbers and held-out mixed-set result provide a clear benchmark for future activation-based detectors.
major comments (2)
- [Abstract and experimental results] Abstract and experimental results: The central claim that the five scalar trajectory features and total path length specifically capture adversarial phase shifts (rather than generic conversation length or turn count) is load-bearing for the 76.2%→93.8% lift and the 'adversarial restlessness' interpretation. The manuscript reports no ablation that matches adversarial and benign conversations on turn count, total tokens, or topic complexity; without this control, the observed activation displacements could be confounded by longer trajectories in multi-turn data regardless of attack content.
- [Methods and results sections] Methods and results sections: The abstract refers to 'five scalar trajectory features' and 'probe classifiers' but provides no explicit definitions of the features (e.g., how path length is computed across residual-stream positions, which scalars are extracted), classifier architecture, training procedure, error bars, or exact train/test splits. These omissions make it impossible to verify the reported numbers or assess whether the model-specific probes are reproducible.
minor comments (2)
- [Abstract] The term 'adversarial restlessness' is introduced in the abstract without a concise definition; a one-sentence gloss would improve readability for readers unfamiliar with activation trajectories.
- [Results] The paper would benefit from a figure showing example activation trajectories (or path-length histograms) for benign versus adversarial conversations to visually support the 'far exceeds' claim.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback on our work. The comments highlight key areas for strengthening the interpretation of our results and improving reproducibility. We address each major comment point by point below and describe the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and experimental results] Abstract and experimental results: The central claim that the five scalar trajectory features and total path length specifically capture adversarial phase shifts (rather than generic conversation length or turn count) is load-bearing for the 76.2%→93.8% lift and the 'adversarial restlessness' interpretation. The manuscript reports no ablation that matches adversarial and benign conversations on turn count, total tokens, or topic complexity; without this control, the observed activation displacements could be confounded by longer trajectories in multi-turn data regardless of attack content.
Authors: We agree that an explicit control for conversation length, token count, and topic complexity is necessary to isolate whether the trajectory features encode phase shifts rather than generic multi-turn dynamics. Our synthetic dataset was constructed with matched turn counts and structures between benign and adversarial conversations, and the three-phase labeling ablation (showing 50-59% false positives under binary labels) provides indirect evidence that the signal is phase-dependent. However, we did not report a post-hoc matched subsample analysis. In the revised manuscript we will add a dedicated ablation subsection that subsamples the held-out synthetic set to enforce exact matching on turn count and total tokens (while preserving topic distribution where possible) and re-evaluate the five scalar features plus path length under these controls. We will also discuss residual potential confounders such as topic complexity. revision: yes
-
Referee: [Methods and results sections] Methods and results sections: The abstract refers to 'five scalar trajectory features' and 'probe classifiers' but provides no explicit definitions of the features (e.g., how path length is computed across residual-stream positions, which scalars are extracted), classifier architecture, training procedure, error bars, or exact train/test splits. These omissions make it impossible to verify the reported numbers or assess whether the model-specific probes are reproducible.
Authors: We acknowledge that the current manuscript does not supply sufficient implementation-level detail for independent verification. In the revised version we will expand the Methods section with: (i) the precise definition of total path length as the sum of Euclidean distances between consecutive residual-stream activations (averaged over selected layers and positions); (ii) the explicit formulas and extraction procedure for each of the five scalar trajectory features; (iii) the probe classifier architecture (including layer, hidden size, activation, and regularization); (iv) the full training procedure, optimizer, learning-rate schedule, and early-stopping criteria; (v) standard-error bars computed over five random seeds; and (vi) the exact train/validation/test split ratios, conversation counts per source, and random seeds used for all reported numbers. These additions will make the 93.8% accuracy, cross-model replication, and source-dependent generalization results fully reproducible. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper computes total path length and five scalar trajectory features directly from differences in residual-stream activations across conversation turns. These quantities are defined and extracted without reference to attack labels or downstream classifier outputs. Classifiers are trained on the resulting features using externally labeled datasets (synthetic three-phase labels, LMSYS-Chat-1M, SafeDialBench) and evaluated on held-out splits and cross-source transfers. The reported lift (76.2 % to 93.8 %) and source-dependent generalization numbers are measured on these held-out sets. No equation or definition equates a claimed prediction to its own fitting procedure, no load-bearing result is justified solely by self-citation, and no ansatz or uniqueness claim is smuggled in. The work is therefore a standard supervised feature-based detector whose central empirical claims remain independent of the training process itself.
Axiom & Free-Parameter Ledger
free parameters (2)
- five scalar trajectory features
- probe classifiers
axioms (2)
- domain assumption Activation shifts in the residual stream correspond to semantic phase changes in the conversation.
- domain assumption The total path length in activation space exceeds that of benign conversations for attacks.
invented entities (1)
-
adversarial restlessness
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Measuring ai agents’ progress on multi-step cyber attack scenarios,
Measuring AI agents’ progress on multi-step cy- ber attack scenarios.arXiv preprint arXiv:2603.11214. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeff Wu. 2024. Scaling and evaluating sparse au- toencoders.arXiv preprint arXiv:2406.04093. Nicholas Goldowsky-Dill, Bilal Chughtai, and Stef...
-
[2]
Scaling monosemanticity: Extracting inter- pretable features from Claude 3 Sonnet.Transformer Circuits Thread. Zonghao Ying, Deyue Zhang, Zhong Jing, Yisong Xiao, Qingchuan Zou, Aishan Liu, and Siyuan Liang. 2025. Reasoning-augmented conversation for multi-turn jail- break attacks on large language models. InFindings of EMNLP. Xiaoyu Zhang, Zhiyuan Zhao, ...
-
[3]
classifies the 128-dim embedding concatenated with 5 trajectory scalars (133 features). θ=0.5 de- fault cutoff, no threshold tuning on held-out data. Stage 4 — Inference.At each user turn, extract activation, encode via frozen MLP, compute trajec- tory scalars, classify via XGBoost. Flag conversation if any turn exceedsθ. Hardware.Activation extraction: N...
work page 2023
-
[4]
Activation Hook:A forward hook on the tar- get model’s decoder layer extracts the residual stream hidden state at each user turn boundary (∼100ms overhead per turn). Activations are Labels Det. FP Three-phase 96–98%0.5–2% Binary 100% 50–59% Table 11:Label ablation on synthetic data (ranges across 4 models). Binary conversation-level labels produce a degen...
-
[5]
Con- versations exceeding θ are flagged for review
Streaming Probe:The XGBoost classifier eval- uates each turn in real time, computingP adv(t) from the activation and trajectory scalars. Con- versations exceeding θ are flagged for review. The probe runs on CPU alongside the GPU in- ference pipeline
-
[6]
Review and Labeling:Flagged conversations are routed for labeling. This can be automated via an LLM judge (e.g., an ensemble of a text- level prompt classifier and the activation probe), with human operators reviewing disagreements. Corrected labels feed back into retraining. Hy- brid human-LLM review reduces the labeling bottleneck while maintaining labe...
-
[7]
Retraining Pipeline:Periodically (e.g., daily or weekly), the probe retrains on the original training data plus all newly labeled production conversations. Retraining requires no GPU— only cached activations and the XGBoost fit (<30s on CPU for 20,000+ turns). N.2 The Adaptation Loop The key insight enabling continual adaptation is the separation ofactiva...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.