LLMs encode moral foundations in human-aligned, layered representations that arise from pretraining and can be steered via dense vectors or sparse SAE features.
terminal peak
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Tracing Moral Foundations in Large Language Models
LLMs encode moral foundations in human-aligned, layered representations that arise from pretraining and can be steered via dense vectors or sparse SAE features.