PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
Obfuscated activations bypass llm latent-space defenses
8 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 8representative citing papers
PROPEL amortizes solver evaluation with a trained activation probe to optimize task generators toward a target solve rate, raising the share of learnable tasks from ~10% to ~20% in coding and SWE experiments.
Behavioral safety metrics for LLMs are insufficient because models can maintain safe outputs while remaining vulnerable to latent-space interventions, as shown via dissociated models and the new Latent Vulnerability Score.
Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.
REALISTA generates semantically coherent adversarial prompts via latent-space optimization over input-dependent editing directions, achieving stronger hallucination elicitation than prior realistic attacks on open-source and reasoning LLMs.
LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.
citing papers explorer
-
Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations
Deception probes in LLMs collapse under stylistic shifts but recover with style-augmented training, rejecting single-direction and entropy hypotheses in favor of distributed multi-dimensional signals.
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA generates semantically coherent adversarial prompts via latent-space optimization over input-dependent editing directions, achieving stronger hallucination elicitation than prior realistic attacks on open-source and reasoning LLMs.
-
How Language Models Process Out-of-Distribution Inputs: A Two-Pathway Framework
LLM OOD detectors are length-confounded; a two-pathway embedding-plus-trajectory framework detects covert OOD inputs at 0.721 average AUROC and 0.850 on jailbreaks.