PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
arXiv preprint arXiv:2510.11905 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Introduces Latent Performance Profiling (LPP) as a task-agnostic framework deriving scalar metrics from LLM latent representations and dynamics to complement benchmark evaluations.
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
Multi-format training on a subset of data boosts cross-format robustness and performance in LLMs like GLM4 and Llama-3.1, with 30% coverage recovering most benefits.
citing papers explorer
-
PRISM: Recovering Instruction Sets from Language Model Activations
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
-
Latent Performance Profiling of Large Language Models
Introduces Latent Performance Profiling (LPP) as a task-agnostic framework deriving scalar metrics from LLM latent representations and dynamics to complement benchmark evaluations.
-
ToxiREX: A Dataset on Toxic REasoning in ConteXt
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
-
Improving Cross-Format Robustness in Language Models with Multi-Format Training
Multi-format training on a subset of data boosts cross-format robustness and performance in LLMs like GLM4 and Llama-3.1, with 30% coverage recovering most benefits.