Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
On the non-identifiability of steering vectors in large language models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 2roles
background 1polarities
background 1representative citing papers
ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.
citing papers explorer
-
Low-Resource Safety Failures Are Action Failures, Not Representation Failures
Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.
-
ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data
ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.