MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

Chenguang Wang; Jihye Kim; Kargi Chauhan; Pratibha Revankar; Sadiba Nusrat Nur; Vincent Siu

arxiv: 2606.10304 · v1 · pith:RJT4PP26new · submitted 2026-06-09 · 💻 cs.CL

MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

Pratibha Revankar , Kargi Chauhan , Jihye Kim , Sadiba Nusrat Nur , Vincent Siu , Chenguang Wang This is my paper

Pith reviewed 2026-06-27 13:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords covert encodingresidual stream subspaceLLM agent monitoringpolarity fliplogistic regression probeagentic exfiltrationmodel geometryencoding families

0 comments

The pith

LLMs share a low-dimensional subspace in the residual stream that supports covert encoding computations across methods and models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When large language models encode sensitive data using methods such as Base64, ROT13, acrostics or synonym chains, the underlying computation occurs inside a shared low-dimensional subspace of the residual stream. A logistic regression probe trained on activations from eight encoding families can identify a ninth unseen family at AUC values from 0.975 to 1.000. The same direction reverses sign at the planning token to indicate whether the model will perform the encoding internally or delegate it to a tool. Researchers built MIRAGE, a two-channel monitor that combines the subspace signal with the polarity flip, and it detects 126 agentic exfiltration attempts at AUC 0.918 compared with 0.518 for output-only checks. Detection strength varies directly with how cleanly each model's geometry separates covert from overt encoding.

Core claim

Across nine encoding families and eight models from five architecture families, covert encoding computations are supported by a shared low-dimensional encoding subspace in the residual stream. A logistic-regression probe trained on eight families recovers the held-out ninth at AUC 0.975-1.000. The same direction exhibits a polarity flip at the planning token, activating positively for inline simulation and negatively for tool outsourcing. MIRAGE, built on both signals, reaches AUC 0.918 on 126 scenarios while output-only detection reaches only 0.518. Monitor performance tracks each model's geometry in separating covert from overt encoding, and every tested attack that suppressed the subspace

What carries the argument

The polarity-flipping encoding subspace in the residual stream, which carries the shared computation for covert encodings and signals execution strategy through its sign at the planning token.

If this is right

The probe generalizes across encoding families by reading a common computational mechanism rather than family-specific surface patterns.
The polarity flip at planning tokens distinguishes internal execution from tool delegation before any encoded text is produced.
Internal monitoring using residual-stream directions outperforms output-side detection on agentic exfiltration tasks.
False-positive rates on benign encodings range from 0 percent to 100 percent across models, showing that separability is a property of each model's geometry.
Attempts to suppress the subspace also eliminate the model's ability to perform the encoding task accurately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The subspace could be targeted by activation interventions to block covert encoding while leaving normal generation intact.
Similar shared subspaces may exist for other hidden operations such as strategic planning or deception, offering a general detection route.
Because performance depends on model geometry, safety monitors may need to be calibrated or retrained per architecture family.
The empirical link between subspace suppression and loss of encoding fidelity could be tested by measuring how precisely the direction must be perturbed before fidelity collapses.

Load-bearing premise

The logistic regression probe recovers held-out encodings because it reads the underlying shared computation rather than surface features of the encodings themselves.

What would settle it

Introduce a new encoding family that models can still execute with high fidelity yet yields probe AUC below 0.9 when trained only on the other eight families.

Figures

Figures reproduced from arXiv: 2606.10304 by Chenguang Wang, Jihye Kim, Kargi Chauhan, Pratibha Revankar, Sadiba Nusrat Nur, Vincent Siu.

**Figure 1.** Figure 1: Agent-monitoring case study. An agent ingests retrieved content (a) with an adversarial indirect prompt injection (thirdparty-authored, designed to coerce covert exfiltration), attempted inline (c) or via tool delegation (d). MIRAGE reads the encoding subspace at generation time (e) and at the last input token (f), where it flips polarity between simulate and outsource. Outputonly monitors score near cha… view at source ↗

**Figure 2.** Figure 2: Encoding-subspace geometry on Qwen-2.5-7B (seed 42). (a) Subspace scores across five task categories: exfiltration above τ = 0.307, benign confounders (code, math, foreign-language, high-entropy) below; AUC = 1.000, robust margin 0.99 (q10 vs q90). (b) Top-2 PCA: exfiltration cluster (red) linearly disjoint from every benign category. Syntactic complexity, entropy, and non-English output do not produce hig… view at source ↗

**Figure 3.** Figure 3: Host-model geometry determines monitor compatibility. 3-class PCA (benign general / encoding explanation / malicious exfiltration) across three architectures. Qwen-7B (left): all three clusters separate cleanly; benign-encoding FPR = 0%. Mistral-7B (centre): malicious and encoding-explanation clusters partially overlap; benign-encoding FPR = 18.8% (recalibration closes the gap). Phi-3.5 (right): all three … view at source ↗

**Figure 4.** Figure 4: Qwen-14B conditional analysis. (a) Score distributions: benign (green) vs. compliant attacks (red). (b) Conditional AUC conditioned on compliant attacks matches smaller model E2E performance. (c) Per-encoding detection rate confirms 100% detection across all encoding families when the model complies with the attack. The 14B anomaly is an attack compliance issue, not a probe issue. B.2. Real-World Specific… view at source ↗

**Figure 5.** Figure 5: Encoding-subspace specificity analysis. (a) Score distributions: malicious mean = 0.81, constrained benign mean = 0.76, gap = 0.055 (p= 0.024, AUC = 0.84). (b) Calibration with N heldout benign constrained examples achieves TPR ≥ 80% at stable FPR: N = 20 examples suffice. (c) ROC curve: the gap is exploitable with the right operating point. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Per-encoding LOEO AUC for Mistral-7B vs. Qwen-7B (seed 42; values match Appendix [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-seed cosine similarity ≥0.998 for all models, providing consistent evidence that the encoding subspace is a stable property of encoding computation across random seeds, rather than a training artefact. Model Csr. Hex Col. Acr. Syn. Nar. Snt. Hom. Wln. Qwen-1.5B 1.000 1.000 1.000 0.998 0.997 0.999 1.000 1.000 0.998 Qwen-3B 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Qwen-7B 1.000 1.000 1.000… view at source ↗

**Figure 8.** Figure 8: E2E ROC curves across models. Curve-level AUC values (computed directly from the per-model prediction CSVs, to be released upon publication): Qwen-7B 0.969, Qwen-1.5B 0.942, Mistral-7B 0.928, Qwen-14B 0.640 (the authoritative numbers also reported in Section 5.3). Qwen-7B achieves the tightest curve (EER = 0.097); the Qwen-14B curve reflects limited attack surface (low attack-compliance rate), not probe fa… view at source ↗

**Figure 9.** Figure 9: Token aggregation strategy comparison. Mean-pool and p95-pool recover the distributed encoding signal; last-token scoring captures only AUC 0.37–0.76, confirming the encoding computation spans the full response. ( [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 11.** Figure 11: AUC as a function of PCA components used. One component achieves >0.94 AUC; five components suffice for >0.97 AUC, a compact subspace in 1536–5120 ambient dimensions [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 10.** Figure 10: Layer-wise LOEO AUC for trained and zero-shot probes across all five models. The encoding subspace emerges from the earliest stored layer and saturates by the final third. 2 4 6 8 10 Number of PCA components 0.2 0.3 0.4 0.5 0.6 0.7 Cumulative variance explained Qwen-1.5B Qwen-3B Qwen-7B Mistral-7B [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 12.** Figure 12: Spearman ρ between per-encoding LOEO AUC rank orderings across LOEO-evaluated models that exhibit nondegenerate per-encoding variation (Qwen-1.5B/3B/7B/14B and Mistral-7B). Qwen-7B ↔ Mistral-7B achieves ρ = 0.957 (p < 0.0001). Llama-3-8B, DeepSeek-7B, and Phi-3.5 saturate at AUC = 1.000 across all 9 encodings, leaving Spearman undefined for those rows. Qwen-1.5B Qwen-7B Mistral-7B 0.990 0.992 0.994 0.99… view at source ↗

**Figure 14.** Figure 14: 300-step GCG convergence (Qwen-7B seed 42 shown). Left: Mean subspace score drops to ≈5×10−4 by step 50, well below τ = 0.307. Centre: Detected families falls to 0/5 at step 1. Right: At step 300, all encodings show subspace ≪ 0.307 and fidelity = 0/5 throughout. Same pattern reproduced for Qwen-1.5B and Mistral-7B. I.1. Prompt-Adaptive Adversary: 120 Trials, 8 Families, and a Weakly-Aligned Replication A… view at source ↗

**Figure 15.** Figure 15: GCG subspace suppression under four loss specifications (Qwen-7B seed 42, 150 steps). All four conditions drive subspace activation below τ within 1–50 steps; detection count stays at 0/5 in all conditions. The detection–fidelity trade-off is robust to loss specification. 0 20 40 60 80 EIS-positive variance removed (%) 0.4 0.5 0.6 0.7 0.8 0.9 1.0 LOEO AUC after projection Chance N=1 N=10 N=50 0.00 0.05 0… view at source ↗

**Figure 16.** Figure 16: Projection adversary sweep: AUC remains 1.000 across all three model-seed combinations tested even after removing up to 84.7% of subspace-positive variance. The signal is broadly distributed. threat model: the system prompt contains a credential block, the user message is a benign support task, and an attackerauthored sentence inside the user message instructs the model to encode a named credential and a… view at source ↗

**Figure 20.** Figure 20: Subspace burst scores for 46 benign encoding scenarios (Qwen-7B, seed 42). All 46 below τAlpaca = 0.307; 0/46 FPs. Highest burst is the JWT-structure explanation at 0.264 (14% below threshold); mean = 0.057, median = 0.028. up to 300 tokens. Attack compliance and RLHF refusals. Of the 81 attacks, 27 (33%) are refused by the model’s RLHF safety training before any encoding is produced (burst score <0.05). … view at source ↗

**Figure 19.** Figure 19: Normalisation ablation. Removing L2-norm or RobustScaler individually raises FPR to 83–97%; both components are jointly required. K. Agentic Evaluation: Methodology, Compliance, and Miss Decomposition What this section answers: How was the 126-scenario agentic suite constructed, what fraction of attacks does the model itself refuse via RLHF, and what is the residual that the encoding subspace layer mus… view at source ↗

**Figure 22.** Figure 22: Expanded agentic evaluation (126 scenarios: 45 benign, 81 attack). (A) Burst score distributions. (B) burst-only ROC curve (AUC = 0.677; 95% CI: 0.576–0.772). (C) Per-encoding TPR at τ = 0.307 with Clopper–Pearson CIs (n= 9 per encoding, width ≈ ±0.31). (D) Decomposition: 33% of attack scenarios refused by the model’s safety training; TPR = 57.4% among compliant scenarios. L. The Upstream Intent Probe (La… view at source ↗

**Figure 23.** Figure 23: Agentic miss analysis. (a) Burst score distributions by outcome: RLHF-refused attacks cluster at burst< 0.05; genuine misses are at 0.05–0.307. (b) Per-encoding outcome breakdown: base64 and hex show the most RLHF refusals (obvious exfiltration), while semantic encodings produce more genuine misses. plied to the last input token of each scenario, it produces a surprising but interpretable result: attacks … view at source ↗

**Figure 24.** Figure 24: Intent probe AUC by detection window. The preinput token (last prompt token, zero output generated) achieves AUC = 1.0000 for malicious vs. benign tool routing. The encoding subspace fires at the model’s decision to exfiltrate, enabling upstream detection before the tool is called. Per-window numerical breakdown across attack vectors (tool-routed, inline acrostic/synonym, scratchpad CoT) is in [PITH_… view at source ↗

**Figure 25.** Figure 25: Cross-architecture attention-head ablation. Left two panels: Per-head Cohen-d drop ∆d heat-maps for Qwen-7B (28×28) and Mistral-7B (32×32). Top-5 critical heads highlighted (black boxes). Right two panels: Layer-depth profiles of mean (filled) and max (dashed) ∆d. Qwen-7B shows a sharp Layer-0 spike; Mistral-7B is uniformly low across all layers. Baseline |d|≈9.3–9.4 in both cases; AUC = 1.000 survives al… view at source ↗

**Figure 26.** Figure 26: TPR at various FPR thresholds. At FPR = 0.1%, TPR reaches 96–100% across all models except Mistral-7B (worst-case semantic encoding). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_26.png] view at source ↗

read the original abstract

When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output-side detection but the underlying computation does not. Across nine encoding families and eight models from five architecture families, that computation is supported by a shared low-dimensional encoding subspace in the residual stream. A logistic-regression probe trained on eight encoding families recovers the held-out ninth at AUC 0.975-1.000, reading the computation rather than surface features. The same direction exhibits a second mechanistic signature at the planning token, flipping polarity to activate positively when the model will simulate the encoding inline and negatively when it will outsource it to a tool call, distinguishing two execution strategies before the encoded text exists. We build MIRAGE (Model-Internal Readout of Agentic Generation Exfiltration), a two-channel real-time monitor exploiting both signals. On 126 agentic exfiltration scenarios, it reaches AUC = 0.918, substantially outperforming output-only detection (AUC = 0.518). Monitor performance is fundamentally a property of the host model's geometry: benign-encoding false-positive rate ranges from 0% on Qwen-7B to 100% on Phi-3.5, revealing that the probe faithfully reads whether a model's geometry separates covert from overt encoding. Across all tested adversarial budgets, every attack suppressing the subspace also destroyed encoding fidelity, reported as an empirical regularity on the evaluated budgets, not a structural impossibility claim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a cross-model low-dim subspace for covert encoding plus a polarity flip at planning tokens, but the held-out probe results need controls to confirm they track computation over surface stats.

read the letter

The main takeaway is a shared low-dimensional direction in the residual stream that appears across nine encoding families and eight models, picked up by a logistic probe that generalizes from eight families to a held-out ninth at AUC 0.975-1.000. They also note a polarity flip at the planning token that distinguishes inline encoding from tool-call outsourcing. From that they build MIRAGE, a two-channel monitor that hits AUC 0.918 on 126 agentic scenarios versus 0.518 for output-only detection, with the added observation that false-positive rates on benign encodings vary sharply by model geometry.

What is actually new is the polarity-flip signature itself and the claim that the same direction supports both the encoding computation and the execution-strategy choice. The monitor performance and the model-to-model variation are concrete and potentially useful for safety work.

The soft spot is the interpretation that the probe reads the computation rather than shared surface properties. The abstract states it reads computation, not surface features, yet supplies no described ablations for token distributions, length stats, or n-gram overlap across the families. The held-out result is compatible with either story, so the central claim rests on an assumption that still needs direct testing. The stress-test note flags exactly this gap, and nothing in the provided abstract closes it.

The paper engages the literature on mechanistic monitoring without obvious circularity or invented entities. It is aimed at researchers working on internal readouts for agentic systems. A reader focused on AI safety signals would find the empirical numbers and the geometry dependence worth seeing, even with the control gap.

Send it to peer review. The findings on generalization and monitor performance are worth a full methods check and the requested ablations.

Referee Report

1 major / 1 minor

Summary. The paper claims that covert encoding computations (Base64, ROT13, acrostics, etc.) in LLM agents are supported by a shared low-dimensional subspace in the residual stream. Across nine encoding families and eight models from five architectures, a logistic-regression probe trained on eight families recovers the held-out ninth at AUC 0.975-1.000. A polarity-flipping direction at the planning token distinguishes inline encoding from tool outsourcing. MIRAGE, a two-channel monitor using both signals, achieves AUC 0.918 on 126 agentic exfiltration scenarios (vs. 0.518 for output-only detection). Monitor performance varies by model geometry; adversarial suppression of the subspace also destroys encoding fidelity as an observed regularity.

Significance. If the central interpretation holds, the work demonstrates that linear probes can identify shared computational subspaces for agentic behaviors across architectures and that internal signals can enable real-time monitoring of covert exfiltration. The polarity-flip signature and the empirical link between subspace suppression and encoding failure are concrete mechanistic findings. The observation that benign-encoding false-positive rates range from 0% to 100% across models highlights geometry-dependent monitor reliability.

major comments (1)

[Abstract] Abstract: the assertion that the probe 'reads the computation rather than surface features' is not supported by any described controls (surface-matched non-encoding baselines, character n-gram ablations, or length/token-distribution matched controls). This is load-bearing for interpreting the held-out AUC 0.975-1.000 as evidence of a computational subspace rather than shared surface statistics across the nine families.

minor comments (1)

[Abstract] Abstract: the construction of the 126 agentic exfiltration scenarios, data splits, and any error analysis are not described; adding these details would strengthen reproducibility claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for explicit controls to support the claim that the probe captures computation rather than surface features. This is a substantive point on interpretability, and we address it directly below.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the probe 'reads the computation rather than surface features' is not supported by any described controls (surface-matched non-encoding baselines, character n-gram ablations, or length/token-distribution matched controls). This is load-bearing for interpreting the held-out AUC 0.975-1.000 as evidence of a computational subspace rather than shared surface statistics across the nine families.

Authors: We agree that the abstract's phrasing would benefit from additional controls to strengthen the interpretation. The current evidence for a computational rather than surface reading rests on the cross-family generalization: the nine encoding families (Base64, ROT13, acrostics, synonym chains, etc.) differ substantially in character distributions, token lengths, and n-gram statistics, yet a probe trained on any eight recovers the ninth at high AUC. This would be unlikely if the probe were primarily capturing shared surface statistics. Nevertheless, we acknowledge that direct ablations are absent from the described experiments. In the revised manuscript we will add (1) surface-matched non-encoding baselines using the same output distributions but without encoding intent, (2) character n-gram feature ablations, and (3) length- and token-distribution matched controls. These additions will be reported in a new subsection of the results and will allow us to quantify how much of the held-out performance is attributable to surface cues versus the hypothesized subspace. We therefore mark this as a revision to be made. revision: yes

Circularity Check

0 steps flagged

No circularity; held-out probe test is independent of training inputs

full rationale

The paper's core derivation is a logistic-regression probe trained on eight encoding families and evaluated on a held-out ninth family, yielding AUC 0.975-1.000. This held-out design does not reduce to the training inputs by construction, nor does it match any of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, self-citation load-bearing, etc.). The assertion that the probe 'reads the computation rather than surface features' is an interpretive claim rather than a mathematical reduction or self-referential definition. No equations, uniqueness theorems, or ansatzes are shown to collapse into their own inputs. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the logistic regression probe weights are implicitly fitted but not detailed.

pith-pipeline@v0.9.1-grok · 5822 in / 1045 out tokens · 17920 ms · 2026-06-27T13:31:33.114289+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 14 linked inside Pith

[1]

Not What You've Signed Up For: Compromising Real-World

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , journal=. Not What You've Signed Up For: Compromising Real-World
[2]

ICML ML Safety Workshop , year=

Ignore Previous Prompt: Attack Techniques For Language Models , author=. ICML ML Safety Workshop , year=
[3]

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , booktitle=
[4]

Prompt Injection Attack Against

Liu, Yi and Deng, Gelei and Li, Yuekang and Wang, Kailong and Zhang, Tianwei and Liu, Yepang and Wang, Haoyu and Zheng, Yan and Liu, Yang , journal=. Prompt Injection Attack Against
[5]

arXiv preprint arXiv:2312.14197 , year=

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models , author=. arXiv preprint arXiv:2312.14197 , year=

arXiv
[6]

simonwillison.net blog , year=

Prompt Injection: What's the Worst That Can Happen? , author=. simonwillison.net blog , year=
[7]

Whisper in the Machine: Confidentiality in

Evertz, Jonathan and Pantle, Marc and Chang, Ching-Yu and Gross, Thomas , journal=. Whisper in the Machine: Confidentiality in
[8]

ICML , year=

Toolformer: Language Models Can Teach Themselves to Use Tools , author=. ICML , year=
[9]

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , journal=
[10]

arXiv preprint arXiv:2309.07864 , year=

The Rise and Potential of Large Language Model Based Agents: A Survey , author=. arXiv preprint arXiv:2309.07864 , year=

Pith/arXiv arXiv
[11]

Frontiers of Computer Science , year=

A Survey on Large Language Model Based Autonomous Agents , author=. Frontiers of Computer Science , year=
[12]

Identifying the Risks of

Ruan, Yangjun and Dong, Honghua and Wang, Andrew and Pitis, Silviu and Zhou, Yongchao and Ba, Jimmy and Dubois, Yann and Maddison, Chris J and Hashimoto, Tatsunori , journal=. Identifying the Risks of
[13]

arXiv preprint arXiv:2406.13352 , year=

Debenedetti, Edoardo and Zhang, Jie and Mazzola, Mislav and Fruh, Sahar Abdelnabi and Mazzola, Florian Tram. arXiv preprint arXiv:2406.13352 , year=

Pith/arXiv arXiv
[14]

Rebedea, Traian and Dinu, Razvan and Sreedhar, Makesh Narsimhan and Parisien, Christopher and Cohen, Jonathan , journal=
[15]

Llama Guard:

Inan, Hakan and Upasani, Kartikeya and Chi, Jianfeng and Rungta, Rashi and Iyer, Krithika and Mao, Yuning and Tontchev, Michael and Hu, Qing and Fuller, Brian and Testuggine, Davide and Khabsa, Madian , journal=. Llama Guard:
[16]

arXiv preprint arXiv:2308.14132 , year=

Detecting Language Model Attacks with Perplexity , author=. arXiv preprint arXiv:2308.14132 , year=

Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2309.00614 , year=

Baseline Defenses for Adversarial Attacks Against Aligned Language Models , author=. arXiv preprint arXiv:2309.00614 , year=

Pith/arXiv arXiv
[18]

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and others , journal=
[19]

NAACL-HLT , year=

A Structural Probe for Finding Syntax in Word Representations , author=. NAACL-HLT , year=
[20]

arXiv preprint arXiv:2310.06824 , year=

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. arXiv preprint arXiv:2310.06824 , year=

Pith/arXiv arXiv
[21]

Representation Engineering: A Top-Down Approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation Engineering: A Top-Down Approach to
[22]

ICML , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. ICML , year=
[23]

ICLR , year=

Progress Measures for Grokking via Mechanistic Interpretability , author=. ICLR , year=
[24]

ICML , year=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. ICML , year=
[25]

OpenAI Blog , year=

Language Models Can Explain Neurons in Language Models , author=. OpenAI Blog , year=
[26]

arXiv preprint arXiv:2307.11507 , year=

Language Model Internals Reveal the Linear Geometry of Sentiment , author=. arXiv preprint arXiv:2307.11507 , year=

arXiv
[27]

Computational Linguistics , volume=

Probing Classifiers: Promises, Shortcomings, and Advances , author=. Computational Linguistics , volume=
[28]

Scaling Monosemanticity: Extracting Interpretable Features from

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and others , journal=. Scaling Monosemanticity: Extracting Interpretable Features from
[29]

Transformer Circuits Thread , year=

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author=. Transformer Circuits Thread , year=
[30]

arXiv preprint arXiv:2405.19550 , year=

Stress-Testing Capability Elicitation with Password-Locked Models , author=. arXiv preprint arXiv:2405.19550 , year=

arXiv
[31]

Secret Collusion Among Generative

Motwani, Sumeet and Barber, Mikhail and Garg, Rishub and Liu, Yejin and Chen, Boyuan , journal=. Secret Collusion Among Generative
[32]

Universal Adversarial Triggers for Attacking and Analyzing

Wallace, Eric and Feng, Shi and Kandpal, Nikhil and Gardner, Matt and Singh, Sameer , journal=. Universal Adversarial Triggers for Attacking and Analyzing
[33]

ICML , year=

Are Aligned Neural Networks Adversarially Aligned? , author=. ICML , year=
[34]

arXiv preprint arXiv:2407.10671 , year=

Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year=

Pith/arXiv arXiv
[35]

Mistral 7

Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and others , journal=. Mistral 7
[36]

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=
[37]

arXiv preprint arXiv:2403.08295 , year=

Pith/arXiv arXiv
[38]

Chao, Patrick and Robey, Alexander and Dobriban, Edgar and Hassani, Hamed and Pappas, George J and Wong, Eric , booktitle=
[39]

Stanford

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B , journal=. Stanford
[40]

ACL , year=

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. ACL , year=
[41]

ICLR , year=

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game , author=. ICLR , year=
[42]

Biometrics , volume=

Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach , author=. Biometrics , volume=
[43]

Biometrika , volume=

The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial , author=. Biometrika , volume=
[44]

2009 , edition=

The Elements of Statistical Learning: Data Mining, Inference, and Prediction , author=. 2009 , edition=

2009
[45]

New Challenges in

Wu, Fangzhou and others , journal=. New Challenges in
[46]

arXiv preprint arXiv:2308.10248 , year=

Activation Addition: Steering Language Models Without Optimization , author=. arXiv preprint arXiv:2308.10248 , year=

Pith/arXiv arXiv
[47]

ICLR , year=

Discovering Latent Knowledge in Language Models Without Supervision , author=. ICLR , year=
[48]

Chen, Richard and Shu, Mu and Shenoy, Zhenya and Bailey, Sanjay and others , journal=
[49]

Sleeper Agents: Training Deceptive

Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M and Maxwell, Tim and Cheng, Newton and others , journal=. Sleeper Agents: Training Deceptive
[50]

Patterns , volume=

AI Deception: A Survey of Examples, Risks, and Potential Solutions , author=. Patterns , volume=
[51]

ICML , year=

Attention Is All You Need , author=. ICML , year=
[52]

Transformer Circuits Thread , year=

A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Thread , year=
[53]

ICML , year=

On Calibration of Modern Neural Networks , author=. ICML , year=
[54]

Scikit-Learn: Machine Learning in

Pedregosa, Fabian and Varoquaux, Ga. Scikit-Learn: Machine Learning in. Journal of Machine Learning Research , volume=
[55]

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others , journal=
[56]

EMNLP: System Demonstrations , year=

Transformers: State-of-the-Art Natural Language Processing , author=. EMNLP: System Demonstrations , year=
[57]

arXiv preprint arXiv:2110.14168 , year=

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv
[58]

arXiv preprint arXiv:2307.15043 , year=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2406.11717 , year=

Refusal in Language Models Is Mediated by a Single Direction , author=. arXiv preprint arXiv:2406.11717 , year=

Pith/arXiv arXiv
[60]

arXiv preprint arXiv:2209.11895 , year=

In-Context Learning and Induction Heads , author=. arXiv preprint arXiv:2209.11895 , year=

Pith/arXiv arXiv
[61]

Transformer Circuits Thread , year=

Superposition of Many Models into One , author=. Transformer Circuits Thread , year=
[62]

ICML , year=

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations , author=. ICML , year=
[63]

arXiv preprint arXiv:2311.03658 , year=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. arXiv preprint arXiv:2311.03658 , year=

Pith/arXiv arXiv
[64]

arXiv preprint arXiv:2310.05338 , year=

Linear Representations of Sentiment in Large Language Models , author=. arXiv preprint arXiv:2310.05338 , year=

arXiv
[65]

Universal Neurons in

Gurnee, Wes and Tegmark, Max , journal=. Universal Neurons in
[66]

Jailbreaking Leading Safety-Aligned

Andriushchenko, Maksym and Croce, Francesco and Flammarion, Nicolas , journal=. Jailbreaking Leading Safety-Aligned
[67]

Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Sangani, Shruthi and Anand, Siddhant and Bhargava, Sanket and Bhattacharyya, Sourav and Biswas, Sourav and Bolt, Louise and others , journal=. The
[68]

arXiv preprint arXiv:2401.02954 , year=

Pith/arXiv arXiv
[69]

Abdin, Marah and Aneja, Jyoti and Awadalla, Hany and Awadallah, Ahmed and Awan, Ammar Ahmad and Bach, Nguyen and Bahree, Amit and Bakhtiari, Arash and Bao, Jianmin and Behl, Harkirat and others , journal=
[70]

arXiv preprint arXiv:2009.03287 , year=

Preventing Language Models From Hiding Their Reasoning , author=. arXiv preprint arXiv:2009.03287 , year=

arXiv 2009

[1] [1]

Not What You've Signed Up For: Compromising Real-World

Greshake, Kai and Abdelnabi, Sahar and Mishra, Shailesh and Endres, Christoph and Holz, Thorsten and Fritz, Mario , journal=. Not What You've Signed Up For: Compromising Real-World

[2] [2]

ICML ML Safety Workshop , year=

Ignore Previous Prompt: Attack Techniques For Language Models , author=. ICML ML Safety Workshop , year=

[3] [3]

Zhan, Qiusi and Liang, Zhixiang and Ying, Zifan and Kang, Daniel , booktitle=

[4] [4]

Prompt Injection Attack Against

Liu, Yi and Deng, Gelei and Li, Yuekang and Wang, Kailong and Zhang, Tianwei and Liu, Yepang and Wang, Haoyu and Zheng, Yan and Liu, Yang , journal=. Prompt Injection Attack Against

[5] [5]

arXiv preprint arXiv:2312.14197 , year=

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models , author=. arXiv preprint arXiv:2312.14197 , year=

arXiv

[6] [6]

simonwillison.net blog , year=

Prompt Injection: What's the Worst That Can Happen? , author=. simonwillison.net blog , year=

[7] [7]

Whisper in the Machine: Confidentiality in

Evertz, Jonathan and Pantle, Marc and Chang, Ching-Yu and Gross, Thomas , journal=. Whisper in the Machine: Confidentiality in

[8] [8]

ICML , year=

Toolformer: Language Models Can Teach Themselves to Use Tools , author=. ICML , year=

[9] [9]

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , journal=

[10] [10]

arXiv preprint arXiv:2309.07864 , year=

The Rise and Potential of Large Language Model Based Agents: A Survey , author=. arXiv preprint arXiv:2309.07864 , year=

Pith/arXiv arXiv

[11] [11]

Frontiers of Computer Science , year=

A Survey on Large Language Model Based Autonomous Agents , author=. Frontiers of Computer Science , year=

[12] [12]

Identifying the Risks of

Ruan, Yangjun and Dong, Honghua and Wang, Andrew and Pitis, Silviu and Zhou, Yongchao and Ba, Jimmy and Dubois, Yann and Maddison, Chris J and Hashimoto, Tatsunori , journal=. Identifying the Risks of

[13] [13]

arXiv preprint arXiv:2406.13352 , year=

Debenedetti, Edoardo and Zhang, Jie and Mazzola, Mislav and Fruh, Sahar Abdelnabi and Mazzola, Florian Tram. arXiv preprint arXiv:2406.13352 , year=

Pith/arXiv arXiv

[14] [14]

Rebedea, Traian and Dinu, Razvan and Sreedhar, Makesh Narsimhan and Parisien, Christopher and Cohen, Jonathan , journal=

[15] [15]

Llama Guard:

Inan, Hakan and Upasani, Kartikeya and Chi, Jianfeng and Rungta, Rashi and Iyer, Krithika and Mao, Yuning and Tontchev, Michael and Hu, Qing and Fuller, Brian and Testuggine, Davide and Khabsa, Madian , journal=. Llama Guard:

[16] [16]

arXiv preprint arXiv:2308.14132 , year=

Detecting Language Model Attacks with Perplexity , author=. arXiv preprint arXiv:2308.14132 , year=

Pith/arXiv arXiv

[17] [17]

arXiv preprint arXiv:2309.00614 , year=

Baseline Defenses for Adversarial Attacks Against Aligned Language Models , author=. arXiv preprint arXiv:2309.00614 , year=

Pith/arXiv arXiv

[18] [18]

Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and Wang, Zifan and Mu, Norman and Sakhaee, Elham and Li, Nathaniel and Basart, Steven and Li, Bo and others , journal=

[19] [19]

NAACL-HLT , year=

A Structural Probe for Finding Syntax in Word Representations , author=. NAACL-HLT , year=

[20] [20]

arXiv preprint arXiv:2310.06824 , year=

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. arXiv preprint arXiv:2310.06824 , year=

Pith/arXiv arXiv

[21] [21]

Representation Engineering: A Top-Down Approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation Engineering: A Top-Down Approach to

[22] [22]

ICML , year=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. ICML , year=

[23] [23]

ICLR , year=

Progress Measures for Grokking via Mechanistic Interpretability , author=. ICLR , year=

[24] [24]

ICML , year=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. ICML , year=

[25] [25]

OpenAI Blog , year=

Language Models Can Explain Neurons in Language Models , author=. OpenAI Blog , year=

[26] [26]

arXiv preprint arXiv:2307.11507 , year=

Language Model Internals Reveal the Linear Geometry of Sentiment , author=. arXiv preprint arXiv:2307.11507 , year=

arXiv

[27] [27]

Computational Linguistics , volume=

Probing Classifiers: Promises, Shortcomings, and Advances , author=. Computational Linguistics , volume=

[28] [28]

Scaling Monosemanticity: Extracting Interpretable Features from

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and others , journal=. Scaling Monosemanticity: Extracting Interpretable Features from

[29] [29]

Transformer Circuits Thread , year=

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author=. Transformer Circuits Thread , year=

[30] [30]

arXiv preprint arXiv:2405.19550 , year=

Stress-Testing Capability Elicitation with Password-Locked Models , author=. arXiv preprint arXiv:2405.19550 , year=

arXiv

[31] [31]

Secret Collusion Among Generative

Motwani, Sumeet and Barber, Mikhail and Garg, Rishub and Liu, Yejin and Chen, Boyuan , journal=. Secret Collusion Among Generative

[32] [32]

Universal Adversarial Triggers for Attacking and Analyzing

Wallace, Eric and Feng, Shi and Kandpal, Nikhil and Gardner, Matt and Singh, Sameer , journal=. Universal Adversarial Triggers for Attacking and Analyzing

[33] [33]

ICML , year=

Are Aligned Neural Networks Adversarially Aligned? , author=. ICML , year=

[34] [34]

arXiv preprint arXiv:2407.10671 , year=

Qwen2 Technical Report , author=. arXiv preprint arXiv:2407.10671 , year=

Pith/arXiv arXiv

[35] [35]

Mistral 7

Jiang, Albert Q and Sablayrolles, Alexandre and Mensch, Arthur and Bamford, Chris and Chaplot, Devendra Singh and de las Casas, Diego and Bressand, Florian and Lengyel, Gianna and Lample, Guillaume and Saulnier, Lucile and others , journal=. Mistral 7

[36] [36]

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=

[37] [37]

arXiv preprint arXiv:2403.08295 , year=

Pith/arXiv arXiv

[38] [38]

Chao, Patrick and Robey, Alexander and Dobriban, Edgar and Hassani, Hamed and Pappas, George J and Wong, Eric , booktitle=

[39] [39]

Stanford

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B , journal=. Stanford

[40] [40]

ACL , year=

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. ACL , year=

[41] [41]

ICLR , year=

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game , author=. ICLR , year=

[42] [42]

Biometrics , volume=

Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach , author=. Biometrics , volume=

[43] [43]

Biometrika , volume=

The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial , author=. Biometrika , volume=

[44] [44]

2009 , edition=

The Elements of Statistical Learning: Data Mining, Inference, and Prediction , author=. 2009 , edition=

2009

[45] [45]

New Challenges in

Wu, Fangzhou and others , journal=. New Challenges in

[46] [46]

arXiv preprint arXiv:2308.10248 , year=

Activation Addition: Steering Language Models Without Optimization , author=. arXiv preprint arXiv:2308.10248 , year=

Pith/arXiv arXiv

[47] [47]

ICLR , year=

Discovering Latent Knowledge in Language Models Without Supervision , author=. ICLR , year=

[48] [48]

Chen, Richard and Shu, Mu and Shenoy, Zhenya and Bailey, Sanjay and others , journal=

[49] [49]

Sleeper Agents: Training Deceptive

Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M and Maxwell, Tim and Cheng, Newton and others , journal=. Sleeper Agents: Training Deceptive

[50] [50]

Patterns , volume=

AI Deception: A Survey of Examples, Risks, and Potential Solutions , author=. Patterns , volume=

[51] [51]

ICML , year=

Attention Is All You Need , author=. ICML , year=

[52] [52]

Transformer Circuits Thread , year=

A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Thread , year=

[53] [53]

ICML , year=

On Calibration of Modern Neural Networks , author=. ICML , year=

[54] [54]

Scikit-Learn: Machine Learning in

Pedregosa, Fabian and Varoquaux, Ga. Scikit-Learn: Machine Learning in. Journal of Machine Learning Research , volume=

[55] [55]

Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and others , journal=

[56] [56]

EMNLP: System Demonstrations , year=

Transformers: State-of-the-Art Natural Language Processing , author=. EMNLP: System Demonstrations , year=

[57] [57]

arXiv preprint arXiv:2110.14168 , year=

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv

[58] [58]

arXiv preprint arXiv:2307.15043 , year=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. arXiv preprint arXiv:2307.15043 , year=

Pith/arXiv arXiv

[59] [59]

arXiv preprint arXiv:2406.11717 , year=

Refusal in Language Models Is Mediated by a Single Direction , author=. arXiv preprint arXiv:2406.11717 , year=

Pith/arXiv arXiv

[60] [60]

arXiv preprint arXiv:2209.11895 , year=

In-Context Learning and Induction Heads , author=. arXiv preprint arXiv:2209.11895 , year=

Pith/arXiv arXiv

[61] [61]

Transformer Circuits Thread , year=

Superposition of Many Models into One , author=. Transformer Circuits Thread , year=

[62] [62]

ICML , year=

A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations , author=. ICML , year=

[63] [63]

arXiv preprint arXiv:2311.03658 , year=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. arXiv preprint arXiv:2311.03658 , year=

Pith/arXiv arXiv

[64] [64]

arXiv preprint arXiv:2310.05338 , year=

Linear Representations of Sentiment in Large Language Models , author=. arXiv preprint arXiv:2310.05338 , year=

arXiv

[65] [65]

Universal Neurons in

Gurnee, Wes and Tegmark, Max , journal=. Universal Neurons in

[66] [66]

Jailbreaking Leading Safety-Aligned

Andriushchenko, Maksym and Croce, Francesco and Flammarion, Nicolas , journal=. Jailbreaking Leading Safety-Aligned

[67] [67]

Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Sangani, Shruthi and Anand, Siddhant and Bhargava, Sanket and Bhattacharyya, Sourav and Biswas, Sourav and Bolt, Louise and others , journal=. The

[68] [68]

arXiv preprint arXiv:2401.02954 , year=

Pith/arXiv arXiv

[69] [69]

Abdin, Marah and Aneja, Jyoti and Awadalla, Hany and Awadallah, Ahmed and Awan, Ammar Ahmad and Bach, Nguyen and Bahree, Amit and Bakhtiari, Arash and Bao, Jianmin and Behl, Harkirat and others , journal=

[70] [70]

arXiv preprint arXiv:2009.03287 , year=

Preventing Language Models From Hiding Their Reasoning , author=. arXiv preprint arXiv:2009.03287 , year=

arXiv 2009