Title resolution pending

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author= · 2024

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.

Exploring Concreteness Through a Figurative Lens

cs.CL · 2026-04-20 · unverdicted · novelty 5.0

LLMs compress concreteness into a consistent 1D direction in mid-to-late layers that separates literal from figurative noun uses and supports efficient classification plus steering.

Language-Switching Triggers Take a Latent Detour Through Language Models

cs.CL · 2026-05-18

citing papers explorer

Showing 4 of 4 citing papers.

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition cs.LG · 2026-05-14 · unverdicted · none · ref 5
QAOD projects away question-aligned directions from answer representations to isolate domain-agnostic factuality signals, enabling efficient hallucination detection with top in-domain AUROC and up to 21% better OOD transfer.
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs cs.CL · 2026-05-11 · unverdicted · none · ref 19
Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.
Exploring Concreteness Through a Figurative Lens cs.CL · 2026-04-20 · unverdicted · none · ref 38
LLMs compress concreteness into a consistent 1D direction in mid-to-late layers that separates literal from figurative noun uses and supports efficient classification plus steering.
Language-Switching Triggers Take a Latent Detour Through Language Models cs.CL · 2026-05-18 · unreviewed · ref 25

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer