Linearly decoding refused knowledge in aligned language models.arXiv preprint arXiv:2507.00239, 2025

Aryan Shrivastava, Ari Holtzman · 2025 · arXiv 2507.00239

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

cs.CL · 2026-05-19 · conditional · novelty 8.0

HalluWorld is a controlled benchmark using explicit reference world models to automatically label and disentangle hallucinations in LLMs across synthetic environments with varying complexity and observability.

citing papers explorer

Showing 1 of 1 citing paper.

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models cs.CL · 2026-05-19 · conditional · none · ref 57
HalluWorld is a controlled benchmark using explicit reference world models to automatically label and disentangle hallucinations in LLMs across synthetic environments with varying complexity and observability.

Linearly decoding refused knowledge in aligned language models.arXiv preprint arXiv:2507.00239, 2025

fields

years

verdicts

representative citing papers

citing papers explorer