arXiv preprint arXiv:2510.11905 , year=

LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance , author= · 2025 · arXiv 2510.11905

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

PRISM: Recovering Instruction Sets from Language Model Activations

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.

Latent Performance Profiling of Large Language Models

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

Introduces Latent Performance Profiling (LPP) as a task-agnostic framework deriving scalar metrics from LLM latent representations and dynamics to complement benchmark evaluations.

ToxiREX: A Dataset on Toxic REasoning in ConteXt

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.

Improving Cross-Format Robustness in Language Models with Multi-Format Training

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

Multi-format training on a subset of data boosts cross-format robustness and performance in LLMs like GLM4 and Llama-3.1, with 30% coverage recovering most benefits.

citing papers explorer

Showing 4 of 4 citing papers after filters.

PRISM: Recovering Instruction Sets from Language Model Activations cs.AI · 2026-06-08 · unverdicted · none · ref 60
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.
Latent Performance Profiling of Large Language Models cs.CL · 2026-05-28 · unverdicted · none · ref 20
Introduces Latent Performance Profiling (LPP) as a task-agnostic framework deriving scalar metrics from LLM latent representations and dynamics to complement benchmark evaluations.
ToxiREX: A Dataset on Toxic REasoning in ConteXt cs.CL · 2026-06-26 · unverdicted · none · ref 127
ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.
Improving Cross-Format Robustness in Language Models with Multi-Format Training cs.CL · 2026-06-10 · unverdicted · none · ref 4
Multi-format training on a subset of data boosts cross-format robustness and performance in LLMs like GLM4 and Llama-3.1, with 30% coverage recovering most benefits.

arXiv preprint arXiv:2510.11905 , year=

fields

years

verdicts

representative citing papers

citing papers explorer