Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Prashant Kulkarni

arxiv: 2604.28129 · v1 · submitted 2026-04-30 · 💻 cs.CR · cs.AI

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Prashant Kulkarni This is my paper

Pith reviewed 2026-05-07 07:00 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords multi-turn prompt injectionadversarial detectionLLM activationsresidual streamtrajectory featuresadversarial restlessnessactivation paths

0 comments

The pith

Multi-turn attacks produce excess activation path lengths in the LLM residual stream that five trajectory features can detect at 93.8 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-turn prompt injection follows phases of trust-building, pivoting and escalation that leave individual turns looking benign, so text-level checks fail to catch them. These phases create a measurable signature in the model's residual stream activations: each shift moves the activation vector, resulting in a total path length far greater than in ordinary conversations. The paper isolates this effect as adversarial restlessness and extracts five scalar trajectory features from the activation paths. When added to a detector, the features raise conversation-level accuracy from 76.2 percent to 93.8 percent on held-out synthetic data. The same pattern appears in four model families spanning 24B to 70B parameters, though the probes trained on one architecture do not transfer to others. Training on a mix of synthetic, real chat logs and benchmark sources reaches 89.4 percent detection at 2.4 percent false-positive rate on mixed held-out data, while binary labels alone produce 50-59 percent false positives.

Core claim

The attack path of trust-building, pivoting and escalation produces phase shifts in residual stream activations that yield substantially longer total path lengths than benign conversations. Five scalar trajectory features extracted from these paths enable conversation-level detection at 93.8 percent accuracy on synthetic held-out data, up from 76.2 percent. This adversarial restlessness signal replicates across four model families (24B-70B), yet probes are architecture-specific and do not transfer. Three-phase turn-level labels are required to keep false positives low; combined training on synthetic, LMSYS-Chat-1M and SafeDialBench sources yields 89.4 percent detection at 2.4 percent false-0

What carries the argument

Adversarial restlessness: the excess total path length in residual-stream activation trajectories caused by the phase shifts of trust-building, pivoting and escalation.

If this is right

Detectors can flag attacks at the conversation level without needing to label any single turn as malicious in text.
Probes must be trained separately for each model family rather than deployed as universal detectors.
Three-phase turn labels are required during training; binary conversation labels alone produce unacceptable false-positive rates.
Combining synthetic, real chat log and benchmark sources during training improves performance on held-out mixed distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be inserted into an inference stack to monitor live conversations by periodically sampling residual activations.
Attacks engineered to avoid abrupt phase shifts or to mimic natural topic changes might shrink the path-length signal and reduce detection rates.
The requirement for diverse training sources implies that maintaining performance will need continuous collection of attack examples from multiple environments.

Load-bearing premise

The observed longer activation paths and trajectory features are caused specifically by the attack phases rather than other conversation dynamics, and the synthetic attack distributions represent real-world attacks well enough for the method to generalize.

What would settle it

A large set of benign multi-turn conversations that contain multiple topic shifts or long interactions but no adversarial intent, measured for activation path lengths, would falsify the claim if their lengths routinely exceed the attack threshold.

Figures

Figures reproduced from arXiv: 2604.28129 by Prashant Kulkarni.

**Figure 1.** Figure 1: The LAD two-stage pipeline. Stage 1: a contrastive MLP projects raw activations (d=5,120) into a 128-dim style-invariant embedding where same-intent turns cluster regardless of conversation style. Stage 2: XGBoost classifies the embedding concatenated with 5 trajectory scalars (133 features). Up to 89.4% detection at 2.4% FP on combined held-out (Qwen 2.5 32B, expanded 3-source training). Synthetic 42.9% L… view at source ↗

**Figure 3.** Figure 3: Turn-level label comparison. Left: Synthetic adversarial conversations exhibit gradual benign→pivoting→adversarial escalation. Center: LMSYS provides only binary labels (no pivoting phase). Right: First adversarial turn position—synthetic attacks pivot late (mean 81%), LMSYS attacks pivot early (mean 26%), providing complementary coverage. pivoting phase lengthens. For structured HACCAstyle attacks with… view at source ↗

**Figure 6.** Figure 6: Cross-model replication (scalar-augmented XGBoost, synthetic held-out, 797 conversations). Detection replicates at 89–96% across all four families with FP rates of 0.5–2.0%. Gemma 3 27B Mistral 3.1 24B Qwen 2.5 32B Llama 3.1 70B 50 60 70 80 90 100 Detection Rate (%) 85.3 80.6 87.7 78.7 89.4 87.6 87.3 81.8 Detection Rate (Combined Eval) Standard XGBoost Contrastive Probe Gemma 3 27B Mistral 3.1 24B Qwen 2.5… view at source ↗

**Figure 5.** Figure 5: Extended pivoting: early detection improves 3–4× with longer pivoting phases. Left: Early detection rate rises monotonically with pivoting turns across all models. Center: Original vs extended comparison. Right: Mean lead time increases from +0.1–0.3 to +1.2–1.6 turns. egories (content shift), and the length confound analysis in section 5.1. 6 Cross-Model Replication To test whether the adversarial activ… view at source ↗

**Figure 8.** Figure 8: Feature ablation heatmap. Left: ∆ detection rate (pp) when each feature is removed. No single scalar dominates (<4pp), confirming a distributed trajectory signal. Right: Key modes—scalars alone detect but with catastrophic FP; activations provide precision. tion without SafeDialBench, 100% FP without LMSYS benign data). • Label ablation: Binary conversation-level labels produce 50–59% FP; three-phase turn… view at source ↗

**Figure 10.** Figure 10: Phase selectivity. Flag rate = flagged turns / total turns per phase. Selectivity S = flag rate(phase)/flag rate(benign); S ≫ 1 indicates selective intent detection. LAD: Spiv=14.9, Sadv=91.0. Lakera: Spiv=1.8, Sadv=2.3 (near-indiscriminate). 7 view at source ↗

**Figure 11.** Figure 11: Trajectory traces on real-world data (LMSYS-Chat1M, mixed training probe). Adversarial conversations show elevated drift; benign remain near zero across diverse models and topics. lier era of adversarial interaction: opportunistic probing without structured methodology. These conversations show early onset (mean 26% vs 81% in synthetic), diffuse drift (1.3–1.9× adv/ben ratio throughout, vs 2–14× spik… view at source ↗

**Figure 12.** Figure 12: Stage 3 training architecture. Stage 3a trains a contrastive MLP on 50K pairs from mixed data (synthetic + LMSYS). Stage 3b freezes the encoder and trains XGBoost on 128-dim embeddings + 5 trajectory scalars (133 features total). label distribution, and a random domain hint (from 20 domains) for diversity. The exact system prompts are below. Attack system prompt: You are generating realistic multi-turn co… view at source ↗

**Figure 13.** Figure 13: Per-category turn structure. Left: Mean turns by phase—role accumulation has the most pivoting (3.6), trust building the fewest (1.7), reflecting real attack dynamics. Right: Pivoting turn distribution per category. D.2 Dataset Design: Three-Phase Labeling Each synthetic turn carries a three-phase label: benign, pivoting, or adversarial. The pivoting label captures the gradual steering phase where the… view at source ↗

**Figure 16.** Figure 16: Expanded training dataset categories. Left: Synthetic attack + benign categories. Center: SafeDialBench attack strategies. Right: LMSYS source models (top 15). The combined held-out evaluation set contains 1,797 conversations (797 synthetic + 800 LMSYS + 200 SafeDialBench). 17 view at source ↗

**Figure 15.** Figure 15: Synthetic multi-turn dataset: training (1,125 conversations) and evaluation (797 conversations) across 6 attack + 4 benign categories. OpenAI moderation flag (binary: benign/adversarial). Training: 249 adversarial + 951 benign. Held-out: 166 adversarial + 634 benign. Models represented: vicuna-13b (47%), koala-13b (6%), alpaca-13b (5%), and 22 others. D.4 SafeDialBench 300 training + 200 held-out evaluat… view at source ↗

**Figure 18.** Figure 18: Detection accuracy vs. extraction layer. With scalar trajectory features, layer choice has <1.2pp effect. 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate ROC LMSYS Eval (Qwen 2.5 32B) Synthetic-only probe (AUC=0.576) Expanded probe (synth+LMSYS+SafeDial) (AUC=0.907) 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Precision-Recall LMSYS Eval (Qwen 2.5… view at source ↗

**Figure 20.** Figure 20: Top 10 XGBoost features for each model (gain-based importance). Trajectory scalars consistently dominate. Individual activation dimensions are model-specific and contribute less. rising steadily from 10% at 1 pivoting turn to over 60% at 3+. This motivated the extended pivoting experiment. Extended pivoting per-model detail. To validate that early detection scales with pivoting phase length, we generated… view at source ↗

**Figure 21.** Figure 21: Adversarial robustness. Detection rate vs drift suppression (α) for three attacker models. A realistic attacker (adversarial turns only) must suppress 80–90% of drift to evade, at which point the model’s internal state is barely being steered. At α=0, activations are unperturbed; at α=1, each turn’s activation equals the previous turn’s (zero drift). After perturbation, all five trajectory scalars are r… view at source ↗

**Figure 22.** Figure 22: LAD production deployment architecture. Stage 1: Target LLM runs inference with an activation hook on layer ℓ. Stage 2: Trajectory scalars and XGBoost probe classify each turn on CPU in real time. Stage 3: Flagged conversations pass through an ensemble second-stage classifier (text-level + activation-level); agreements are auto-labeled, disagreements go to human review (rare). Corrected labels accumulate … view at source ↗

**Figure 23.** Figure 23: SAE feature ablation curve (GemmaScope 2, layer 31, 65k width). Ablating the top-K SAE features (red) has minimal effect (−0.4pp at K=1,000), comparable to random (blue) and bottom-K (green). Detection is driven by trajectory scalars, not SAE content features. 0 1 2 3 4 Feature Importance (%) Turn Pos Cum Drift Mean Drift Drift Accel Cosine Sim Drift Norm SAE #10763 SAE #9897 SAE #56597 SAE #12823 SAE #26… view at source ↗

**Figure 24.** Figure 24: Top features for multi-turn detection (GemmaScope 2, Gemma 3 27B, layer 31, 65k width). Trajectory scalars (red) dominate: turn position alone (4.60%) exceeds all individual SAE latents (blue, top: 0.75%). SAE Features (65,536 dims) Trajectory Scalars (6 dims) 0 20 40 60 80 Total Feature Importance (%) 93.9% 6.1% Importance: SAE vs Scalars (Baseline CV: 95.4 ± 0.8%) 0 1 2 3 4 Feature Importance (%) Drif… view at source ↗

**Figure 25.** Figure 25: GemmaScope 2 SAE analysis (trained with 6 candidate scalars including turn position, before ablation). Left: total feature importance split between 65,536 SAE latents and trajectory scalars. Right: per-scalar breakdown. The final probes use 5 scalars (turn position removed after ablation). 23 view at source ↗

read the original abstract

Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detection from 76.2% to 93.8% on synthetic held-out data. The signal replicates across four model families (24B-70B); probes are model-specific and do not transfer across architectures. Generalization is source-dependent: leave-one-source-out evaluation shows each of synthetic, LMSYS-Chat-1M, and SafeDialBench captures distinct attack distributions, with detection on real-world LMSYS reaching 47-71% when its distribution is represented in training. Combined three-source training achieves 89.4% detection at 2.4% false positive rate on a held-out mixed set. We further show that three-phase turn-level labels(benign/pivoting/adversarial) unique to our synthetic dataset are essential: binary conversation-level labels produce 50-59% false positives. These results establish adversarial restlessness as a reliable activation-level signal and characterize the data requirements for practical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Activation trajectories give a detectable signal for multi-turn attacks on synthetic data, but the length confound and weak real-world generalization are the main things to watch.

read the letter

The core finding is that multi-turn prompt injections produce longer activation paths in the residual stream than benign conversations, and five scalar features extracted from those paths raise conversation-level detection from 76% to 94% on held-out synthetic data. The signal appears across four model families, and the work shows that three-phase turn labels are necessary—binary labels alone produce 50-59% false positives. That is the concrete advance: a simple internal-state monitor that exploits the known attack phases rather than trying to catch each turn in isolation at the text level. The cross-model replication and the source-dependent generalization numbers (47-71% on LMSYS when the distribution is represented) are reported clearly enough to be useful for anyone building activation probes. The three-source training result reaching 89% at 2.4% FPR on mixed held-out data is the strongest empirical point. The soft spot is the missing control for conversation length and turn count. The abstract notes that adversarial paths are longer, but does not say whether benign examples were matched on total turns or tokens before measuring the path-length difference. If longer conversations simply move activations farther regardless of content, the reported lift could be partly artifactual. The drop to 47-71% on real LMSYS data is consistent with that possibility and shows the synthetic distribution does not fully stand in for actual user attacks. Feature definitions and classifier training details are also thin in the abstract, so reproducibility would need the methods section. This paper is for researchers working on internal monitoring and LLM security who already have activation access. A reader who wants to test trajectory features on their own models will get concrete numbers to beat and a clear warning about data source effects. It is worth sending to peer review so the length confound and the exact feature set can be examined; the empirical pattern is sharp enough to justify referee time even if revisions are needed.

Referee Report

2 major / 2 minor

Summary. The paper claims that multi-turn prompt injection attacks follow a three-phase path (trust-building, pivoting, escalation) that produces a detectable 'adversarial restlessness' signature in LLM residual-stream activations, quantified by total path length and five scalar trajectory features. These features improve conversation-level detection from 76.2% to 93.8% on synthetic held-out data, replicate across four model families (24B-70B), require model-specific probes, and show source-dependent generalization (47-71% on LMSYS when represented in training); combined three-source training reaches 89.4% detection at 2.4% FPR. Three-phase turn-level labels are shown to be essential, as binary labels yield 50-59% false positives.

Significance. If the trajectory features specifically encode attack-phase shifts rather than generic multi-turn dynamics, the work offers a new activation-level defense against covert multi-turn attacks that evade text-based detectors. Strengths include the cross-model replication, the explicit demonstration that three-phase labeling is required, and the characterization of data-source requirements for practical deployment. The concrete performance numbers and held-out mixed-set result provide a clear benchmark for future activation-based detectors.

major comments (2)

[Abstract and experimental results] Abstract and experimental results: The central claim that the five scalar trajectory features and total path length specifically capture adversarial phase shifts (rather than generic conversation length or turn count) is load-bearing for the 76.2%→93.8% lift and the 'adversarial restlessness' interpretation. The manuscript reports no ablation that matches adversarial and benign conversations on turn count, total tokens, or topic complexity; without this control, the observed activation displacements could be confounded by longer trajectories in multi-turn data regardless of attack content.
[Methods and results sections] Methods and results sections: The abstract refers to 'five scalar trajectory features' and 'probe classifiers' but provides no explicit definitions of the features (e.g., how path length is computed across residual-stream positions, which scalars are extracted), classifier architecture, training procedure, error bars, or exact train/test splits. These omissions make it impossible to verify the reported numbers or assess whether the model-specific probes are reproducible.

minor comments (2)

[Abstract] The term 'adversarial restlessness' is introduced in the abstract without a concise definition; a one-sentence gloss would improve readability for readers unfamiliar with activation trajectories.
[Results] The paper would benefit from a figure showing example activation trajectories (or path-length histograms) for benign versus adversarial conversations to visually support the 'far exceeds' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our work. The comments highlight key areas for strengthening the interpretation of our results and improving reproducibility. We address each major comment point by point below and describe the revisions we will make.

read point-by-point responses

Referee: [Abstract and experimental results] Abstract and experimental results: The central claim that the five scalar trajectory features and total path length specifically capture adversarial phase shifts (rather than generic conversation length or turn count) is load-bearing for the 76.2%→93.8% lift and the 'adversarial restlessness' interpretation. The manuscript reports no ablation that matches adversarial and benign conversations on turn count, total tokens, or topic complexity; without this control, the observed activation displacements could be confounded by longer trajectories in multi-turn data regardless of attack content.

Authors: We agree that an explicit control for conversation length, token count, and topic complexity is necessary to isolate whether the trajectory features encode phase shifts rather than generic multi-turn dynamics. Our synthetic dataset was constructed with matched turn counts and structures between benign and adversarial conversations, and the three-phase labeling ablation (showing 50-59% false positives under binary labels) provides indirect evidence that the signal is phase-dependent. However, we did not report a post-hoc matched subsample analysis. In the revised manuscript we will add a dedicated ablation subsection that subsamples the held-out synthetic set to enforce exact matching on turn count and total tokens (while preserving topic distribution where possible) and re-evaluate the five scalar features plus path length under these controls. We will also discuss residual potential confounders such as topic complexity. revision: yes
Referee: [Methods and results sections] Methods and results sections: The abstract refers to 'five scalar trajectory features' and 'probe classifiers' but provides no explicit definitions of the features (e.g., how path length is computed across residual-stream positions, which scalars are extracted), classifier architecture, training procedure, error bars, or exact train/test splits. These omissions make it impossible to verify the reported numbers or assess whether the model-specific probes are reproducible.

Authors: We acknowledge that the current manuscript does not supply sufficient implementation-level detail for independent verification. In the revised version we will expand the Methods section with: (i) the precise definition of total path length as the sum of Euclidean distances between consecutive residual-stream activations (averaged over selected layers and positions); (ii) the explicit formulas and extraction procedure for each of the five scalar trajectory features; (iii) the probe classifier architecture (including layer, hidden size, activation, and regularization); (iv) the full training procedure, optimizer, learning-rate schedule, and early-stopping criteria; (v) standard-error bars computed over five random seeds; and (vi) the exact train/validation/test split ratios, conversation counts per source, and random seeds used for all reported numbers. These additions will make the 93.8% accuracy, cross-model replication, and source-dependent generalization results fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper computes total path length and five scalar trajectory features directly from differences in residual-stream activations across conversation turns. These quantities are defined and extracted without reference to attack labels or downstream classifier outputs. Classifiers are trained on the resulting features using externally labeled datasets (synthetic three-phase labels, LMSYS-Chat-1M, SafeDialBench) and evaluated on held-out splits and cross-source transfers. The reported lift (76.2 % to 93.8 %) and source-dependent generalization numbers are measured on these held-out sets. No equation or definition equates a claimed prediction to its own fitting procedure, no load-bearing result is justified solely by self-citation, and no ansatz or uniqueness claim is smuggled in. The work is therefore a standard supervised feature-based detector whose central empirical claims remain independent of the training process itself.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The paper's central claim depends on empirical extraction of trajectory features from LLM activations during specific attack phases, with classifiers trained on synthetic and real datasets. Key free parameters include the feature definitions and probe weights. Assumptions include that attack phases produce unique activation dynamics not mimicked by benign interactions. No new physical entities but a new conceptual one.

free parameters (2)

five scalar trajectory features
The specific definitions and any thresholds or normalizations for the five features are likely fitted or chosen based on the data to capture the restlessness signal.
probe classifiers
Model-specific probes are trained on the features, introducing fitted parameters for the detection model.

axioms (2)

domain assumption Activation shifts in the residual stream correspond to semantic phase changes in the conversation.
Invoked when linking attack phases to activation movements.
domain assumption The total path length in activation space exceeds that of benign conversations for attacks.
Central to defining adversarial restlessness.

invented entities (1)

adversarial restlessness no independent evidence
purpose: To name and conceptualize the activation-level signature left by multi-turn attack paths.
Introduced as a new term based on observed activation trajectories during attack phases.

pith-pipeline@v0.9.0 · 5536 in / 1989 out tokens · 90350 ms · 2026-05-07T07:00:54.239779+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Measuring ai agents’ progress on multi-step cyber attack scenarios,

Measuring AI agents’ progress on multi-step cy- ber attack scenarios.arXiv preprint arXiv:2603.11214. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeff Wu. 2024. Scaling and evaluating sparse au- toencoders.arXiv preprint arXiv:2406.04093. Nicholas Goldowsky-Dill, Bilal Chughtai, and Stef...

work page arXiv 2024
[2]

Xing, Joseph E

Scaling monosemanticity: Extracting inter- pretable features from Claude 3 Sonnet.Transformer Circuits Thread. Zonghao Ying, Deyue Zhang, Zhong Jing, Yisong Xiao, Qingchuan Zou, Aishan Liu, and Siyuan Liang. 2025. Reasoning-augmented conversation for multi-turn jail- break attacks on large language models. InFindings of EMNLP. Xiaoyu Zhang, Zhiyuan Zhao, ...

work page arXiv 2025
[3]

user” and an “assistant

classifies the 128-dim embedding concatenated with 5 trajectory scalars (133 features). θ=0.5 de- fault cutoff, no threshold tuning on held-out data. Stage 4 — Inference.At each user turn, extract activation, encode via frozen MLP, compute trajec- tory scalars, classify via XGBoost. Flag conversation if any turn exceedsθ. Hardware.Activation extraction: N...

work page 2023
[4]

Activations are Labels Det

Activation Hook:A forward hook on the tar- get model’s decoder layer extracts the residual stream hidden state at each user turn boundary (∼100ms overhead per turn). Activations are Labels Det. FP Three-phase 96–98%0.5–2% Binary 100% 50–59% Table 11:Label ablation on synthetic data (ranges across 4 models). Binary conversation-level labels produce a degen...

work page
[5]

Con- versations exceeding θ are flagged for review

Streaming Probe:The XGBoost classifier eval- uates each turn in real time, computingP adv(t) from the activation and trajectory scalars. Con- versations exceeding θ are flagged for review. The probe runs on CPU alongside the GPU in- ference pipeline

work page
[6]

This can be automated via an LLM judge (e.g., an ensemble of a text- level prompt classifier and the activation probe), with human operators reviewing disagreements

Review and Labeling:Flagged conversations are routed for labeling. This can be automated via an LLM judge (e.g., an ensemble of a text- level prompt classifier and the activation probe), with human operators reviewing disagreements. Corrected labels feed back into retraining. Hy- brid human-LLM review reduces the labeling bottleneck while maintaining labe...

work page
[7]

Retraining requires no GPU— only cached activations and the XGBoost fit (<30s on CPU for 20,000+ turns)

Retraining Pipeline:Periodically (e.g., daily or weekly), the probe retrains on the original training data plus all newly labeled production conversations. Retraining requires no GPU— only cached activations and the XGBoost fit (<30s on CPU for 20,000+ turns). N.2 The Adaptation Loop The key insight enabling continual adaptation is the separation ofactiva...

work page

[1] [1]

Measuring ai agents’ progress on multi-step cyber attack scenarios,

Measuring AI agents’ progress on multi-step cy- ber attack scenarios.arXiv preprint arXiv:2603.11214. Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeff Wu. 2024. Scaling and evaluating sparse au- toencoders.arXiv preprint arXiv:2406.04093. Nicholas Goldowsky-Dill, Bilal Chughtai, and Stef...

work page arXiv 2024

[2] [2]

Xing, Joseph E

Scaling monosemanticity: Extracting inter- pretable features from Claude 3 Sonnet.Transformer Circuits Thread. Zonghao Ying, Deyue Zhang, Zhong Jing, Yisong Xiao, Qingchuan Zou, Aishan Liu, and Siyuan Liang. 2025. Reasoning-augmented conversation for multi-turn jail- break attacks on large language models. InFindings of EMNLP. Xiaoyu Zhang, Zhiyuan Zhao, ...

work page arXiv 2025

[3] [3]

user” and an “assistant

classifies the 128-dim embedding concatenated with 5 trajectory scalars (133 features). θ=0.5 de- fault cutoff, no threshold tuning on held-out data. Stage 4 — Inference.At each user turn, extract activation, encode via frozen MLP, compute trajec- tory scalars, classify via XGBoost. Flag conversation if any turn exceedsθ. Hardware.Activation extraction: N...

work page 2023

[4] [4]

Activations are Labels Det

Activation Hook:A forward hook on the tar- get model’s decoder layer extracts the residual stream hidden state at each user turn boundary (∼100ms overhead per turn). Activations are Labels Det. FP Three-phase 96–98%0.5–2% Binary 100% 50–59% Table 11:Label ablation on synthetic data (ranges across 4 models). Binary conversation-level labels produce a degen...

work page

[5] [5]

Con- versations exceeding θ are flagged for review

Streaming Probe:The XGBoost classifier eval- uates each turn in real time, computingP adv(t) from the activation and trajectory scalars. Con- versations exceeding θ are flagged for review. The probe runs on CPU alongside the GPU in- ference pipeline

work page

[6] [6]

This can be automated via an LLM judge (e.g., an ensemble of a text- level prompt classifier and the activation probe), with human operators reviewing disagreements

Review and Labeling:Flagged conversations are routed for labeling. This can be automated via an LLM judge (e.g., an ensemble of a text- level prompt classifier and the activation probe), with human operators reviewing disagreements. Corrected labels feed back into retraining. Hy- brid human-LLM review reduces the labeling bottleneck while maintaining labe...

work page

[7] [7]

Retraining requires no GPU— only cached activations and the XGBoost fit (<30s on CPU for 20,000+ turns)

Retraining Pipeline:Periodically (e.g., daily or weekly), the probe retrains on the original training data plus all newly labeled production conversations. Retraining requires no GPU— only cached activations and the XGBoost fit (<30s on CPU for 20,000+ turns). N.2 The Adaptation Loop The key insight enabling continual adaptation is the separation ofactiva...

work page