Title resolution pending

Refusal in Language Models Is Mediated by a Single Direction , author= · 2024

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

OGLS-SD improves on-policy self-distillation stability and math reasoning performance by constructing an outcome-discriminative steering direction from contrasts between successful and failed teacher logits.

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.

Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Future-rhyme information is linearly decodable at line boundaries across model families and strengthens with scale, yet only Gemma-3-27B causally depends on it, with the driver migrating to the boundary around layer 30 and localizing to five attention heads.

Rigorous Interpretation Is a Form of Evaluation

cs.CY · 2026-05-06 · unverdicted · novelty 5.0

Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.

Multilingual Refusal Alignment for Safer Large Language Models

cs.CL · 2026-04-24 · conditional · novelty 5.0

English-only safety alignment fails to transfer cross-lingually, while multilingual DPO training on the new RefusEU dataset improves safety across 12 European languages without degrading Global MMLU performance.

citing papers explorer

Showing 5 of 5 citing papers.

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning cs.LG · 2026-05-12 · unverdicted · none · ref 34
OGLS-SD improves on-policy self-distillation stability and math reasoning performance by constructing an outcome-discriminative steering direction from contrasts between successful and failed teacher logits.
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs cs.CL · 2026-05-11 · unverdicted · none · ref 18
Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.
Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions cs.LG · 2026-05-08 · unverdicted · none · ref 25
Future-rhyme information is linearly decodable at line boundaries across model families and strengthens with scale, yet only Gemma-3-27B causally depends on it, with the driver migrating to the boundary around layer 30 and localizing to five attention heads.
Rigorous Interpretation Is a Form of Evaluation cs.CY · 2026-05-06 · unverdicted · none · ref 41
Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.
Multilingual Refusal Alignment for Safer Large Language Models cs.CL · 2026-04-24 · conditional · none · ref 8
English-only safety alignment fails to transfer cross-lingually, while multilingual DPO training on the new RefusEU dataset improves safety across 12 European languages without degrading Global MMLU performance.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer