Obfuscated activations bypass llm latent-space defenses

· 2025 · arXiv 2412.09565

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

PRISM: Recovering Instruction Sets from Language Model Activations

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

PROPEL amortizes solver evaluation with a trained activation probe to optimize task generators toward a target solve rate, raising the share of learnable tasks from ~10% to ~20% in coding and SWE experiments.

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

cs.LG · 2026-06-06 · unverdicted · novelty 6.0

Behavioral safety metrics for LLMs are insufficient because models can maintain safe outputs while remaining vulnerable to latent-space interventions, as shown via dissociated models and the new Latent Vulnerability Score.

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

REALISTA generates semantically coherent adversarial prompts via latent-space optimization over input-dependent editing directions, achieving stronger hallucination elicitation than prior realistic attacks on open-source and reasoning LLMs.

How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.

Benchmarking Misuse Mitigation Against Covert Adversaries

cs.CR · 2025-06-06 · unverdicted · novelty 6.0

Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

cs.LG · 2026-06-10 · unverdicted · novelty 5.0

A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations cs.CL · 2026-05-27 · unverdicted · none · ref 2
Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations cs.CL · 2026-05-12 · unverdicted · none · ref 76
REALISTA generates semantically coherent adversarial prompts via latent-space optimization over input-dependent editing directions, achieving stronger hallucination elicitation than prior realistic attacks on open-source and reasoning LLMs.
How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework cs.CL · 2026-04-30 · unverdicted · none · ref 56
LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.

Obfuscated activations bypass llm latent-space defenses

fields

years

verdicts

representative citing papers

citing papers explorer