OGLS-SD improves on-policy self-distillation stability and math reasoning performance by constructing an outcome-discriminative steering direction from contrasts between successful and failed teacher logits.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
background 1polarities
unclear 1representative citing papers
Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.
Future-rhyme information is linearly decodable at line boundaries across model families and strengthens with scale, yet only Gemma-3-27B causally depends on it, with the driver migrating to the boundary around layer 30 and localizing to five attention heads.
Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.
English-only safety alignment fails to transfer cross-lingually, while multilingual DPO training on the new RefusEU dataset improves safety across 12 European languages without degrading Global MMLU performance.
citing papers explorer
-
OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
OGLS-SD improves on-policy self-distillation stability and math reasoning performance by constructing an outcome-discriminative steering direction from contrasts between successful and failed teacher logits.
-
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.
-
Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions
Future-rhyme information is linearly decodable at line boundaries across model families and strengthens with scale, yet only Gemma-3-27B causally depends on it, with the driver migrating to the boundary around layer 30 and localizing to five attention heads.
-
Rigorous Interpretation Is a Form of Evaluation
Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.
-
Multilingual Refusal Alignment for Safer Large Language Models
English-only safety alignment fails to transfer cross-lingually, while multilingual DPO training on the new RefusEU dataset improves safety across 12 European languages without degrading Global MMLU performance.