pith. sign in

arxiv: 2601.05437 · v3 · pith:AKF4OPQBnew · submitted 2026-01-09 · 💻 cs.CL · cs.AI

Tracing Moral Foundations in Large Language Models

Pith reviewed 2026-05-21 16:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsmoral foundations theorymechanistic interpretabilitysparse autoencoderscausal steeringmoral judgmentspretraining effects
0
0 comments X

The pith

Large language models encode moral foundations internally in ways that align with human judgments, emerging from pretraining and modifiable by post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether human-like moral judgments in large language models reflect genuine internal structures or superficial mimicry. It applies Moral Foundations Theory to analyze representations across 14 models from four families and sizes up to 70B parameters. The work combines layer-wise analysis, sparse autoencoders on the residual stream, and causal steering to show that moral concepts form a structured geometry matching human perceptions. This geometry arises naturally during pretraining and undergoes selective changes from instruction tuning. Steering interventions confirm that altering these representations produces corresponding shifts in moral outputs, indicating a causal link rather than mere correlation.

Core claim

Models represent and distinguish moral foundations in a manner that aligns with human judgments, and that this moral geometry naturally emerges from pretraining and is selectively rewired by post-training. SAE features show clear semantic links to specific foundations, and steering along dense vectors or sparse features produces predictable shifts in foundation-relevant behavior.

What carries the argument

Sparse autoencoder features over the residual stream and dense MFT vectors used for layer-wise analysis and causal steering interventions.

If this is right

  • Moral concepts are distributed and layered but show measurable alignment with human perceptual structure.
  • Basic moral geometry arises from statistical regularities in language data during pretraining without explicit supervision.
  • Post-training selectively rewires portions of this geometry while preserving others.
  • Sparse features provide partially disentangled handles on individual moral foundations within shared representations.
  • Causal interventions on these features or vectors produce consistent, predictable shifts in model behavior on moral tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Targeted steering of these features could support more precise control over value-laden outputs in deployed systems.
  • Similar techniques might reveal how other abstract value systems, such as political or fairness concepts, organize in the same models.
  • The pretraining origin suggests moral-like structures could appear in any sufficiently large language model trained on broad text corpora.

Load-bearing premise

The Moral Foundations Theory categories and the specific prompts used to elicit moral judgments provide a faithful and unbiased probe of the models' internal conceptual structure rather than reflecting surface-level pattern matching to the input format.

What would settle it

Observing no predictable change in foundation-relevant outputs when applying the identified steering vectors or SAE features on new scenarios, or finding that the alignment with human moral judgments disappears under different prompt wordings.

Figures

Figures reproduced from arXiv: 2601.05437 by Bowen Yi, Chenxiao Yu, Farzan Karimi-Malekabadi, Jinyi Ye, Morteza Dehghani, Shrikanth Narayanan, Suhaib Abdurahman, Yue Zhao.

Figure 1
Figure 1. Figure 1: Overview of the experimental pipeline. (i) Relative moral concept vectors are constructed from extended Moral Foundations vignettes and serve as a central representational hub. These vectors are validated in parallel through (ii) topological alignment with human-labeled Reddit post distributions and (iii) mechanistic decomposition into sparse autoencoder features. (iv) We then causally intervene on model a… view at source ↗
Figure 2
Figure 2. Figure 2: The geometry of moral alignment in LLAMA. We project human-labeled Reddit posts for each moral foundation, and non-moral data, onto the corresponding foundation-vs.-Social Norm vectors. In contrast, Qwen exhibits a “U-shaped” trajectory, with alignment peaking in early layers (Layer 3) and resurging in deep layers (Layer 23) after a mid￾layer dip. Despite these differences, both models exhibit consistent r… view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise Alignment of Moral Features in LLAMA. Average cosine similarity of the top-3 most aligned SAE features for each Moral Foundation across every 4 layers vs random baselines. Similarity is calcu￾lated between the SAE decoder weights and the corre￾sponding Foundation vs. Social Norms concept vector. steering reveals a clear asymmetry in steerability across moral foundations. Care, Sanctity, and Fair… view at source ↗
Figure 4
Figure 4. Figure 4: Steering results. For each foundation, we plot the MFQ-2 score change ∆Score(α) relative to the unsteered baseline (α = 0) as a function of steering strength α, evaluated at that foundation’s best layer. Points show measured ∆ scores and the solid line shows the corresponding linear trend. The gray dashed line reports general performance (MMLU) under the same interventions. See Appendix B.3 for details. un… view at source ↗
Figure 5
Figure 5. Figure 5: The geometry of moral alignment in Qwen￾2.5-7B-Instruct. We project human-labeled Reddit posts for each moral foundation, and non-moral data, onto the corresponding foundation-vs.-Social Norm vec￾tors. suggest three takeaways. First, in both models, Care and Sanctity emerge as relatively distinct di￾rections in representation space. This finding is consistent with human cross-cultural studies, which show t… view at source ↗
Figure 6
Figure 6. Figure 6: Layer-wise Alignment of Moral Features (Qwen). Average cosine similarity of the top-3 most aligned SAE features for each Moral Foundation across every 4 layers vs random baselines. Similarity is calcu￾lated between the SAE decoder weights and the corre￾sponding Foundation vs. Social Norms concept vector. SAEs and the foundation-specific concept vectors derived from the residual stream (Section 3.3). Based … view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template used for automated interpre [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Steering result (combined). For each foun￾dation, we plot the MFQ-2 score change ∆Score(α) relative to the unsteered baseline (α = 0) as a function of steering strength α, evaluated at that foundation’s best layer. The best layer is chosen as the layer with the largest positive linear response slope. Points show measured ∆ scores and the solid line shows the cor￾responding linear trend. The gray dashed lin… view at source ↗
Figure 11
Figure 11. Figure 11: Slope Magnitude Across Layers (Macro) (Qwen). Absolute steering slopes |kf,l| (|βf,l|)across layers for each foundation, highlighting where steering has the strongest sensitivity regardless of direction. 5 10 15 20 25 Layer Index (0-based) 0.4 0.2 0.0 0.2 0.4 0.6 Slo p e k f, l ( N o r m aliz e d ) L11 L19 QWEN - normalized (÷50) Authority Care Fairness Loyalty Sanctity [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 12
Figure 12. Figure 12: Slope Magnitude Across Layers (Micro) (Qwen). Absolute steering slopes |kf,l| (|βf,l|) across layers for each foundation, highlighting where steering has the strongest sensitivity regardless of direction. Logit-based multiple-choice scoring. To obtain deterministic and reproducible measurements, we score MMLU in a logit-based manner rather than via free-form generation. For each question, we perform a sin… view at source ↗
Figure 13
Figure 13. Figure 13: Prompt template used to expand the Moral [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

Large language models often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed across 14 base and instruction-tuned LLMs spanning four model families (Llama, Qwen2.5, Qwen3-MoE, Mistral) and scales from 7B to 70B. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that models represent and distinguish moral foundations in a manner that aligns with human judgments, and that this moral geometry naturally emerges from pretraining and is selectively rewired by post-training. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that moral foundations from Moral Foundations Theory are represented and distinguished in LLMs in alignment with human judgments. This moral geometry emerges naturally from pretraining, is selectively rewired by post-training, and is supported by sparse SAE features with semantic links to specific foundations. Causal steering via dense MFT vectors or sparse features produces predictable shifts in foundation-relevant behavior, providing mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled.

Significance. If the results hold under scrutiny, the work offers mechanistic interpretability evidence that pluralistic moral structures can arise as latent patterns from language statistics alone. The combination of representation analysis, SAE feature discovery, and causal interventions strengthens claims about internal conceptual organization and suggests avenues for targeted control of moral outputs in AI systems.

major comments (1)
  1. Methods section describing prompt design and MFT elicitation: All three core methods (layer-wise representation analysis, SAE feature extraction, and steering interventions) rely on MFT-derived categories and prompts. To substantiate the claim that the observed moral geometry reflects pre-existing internal structure rather than prompt-induced surface matching, the paper must report controls such as scrambled prompts, neutral templates, or non-MFT moral scenarios. This is load-bearing for the central assertion that the geometry 'naturally emerges from pretraining.'
minor comments (2)
  1. Abstract: Specify the exact split between base and instruction-tuned models among the 14 LLMs to better support claims about pretraining versus post-training effects.
  2. Results on SAE features: Provide quantitative metrics (e.g., activation correlations or inter-rater agreement on semantic interpretations) rather than relying solely on qualitative semantic links to foundations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. The concern about distinguishing pre-existing internal structure from prompt artifacts is a substantive one that directly bears on our central claims. We address it point by point below and commit to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: Methods section describing prompt design and MFT elicitation: All three core methods (layer-wise representation analysis, SAE feature extraction, and steering interventions) rely on MFT-derived categories and prompts. To substantiate the claim that the observed moral geometry reflects pre-existing internal structure rather than prompt-induced surface matching, the paper must report controls such as scrambled prompts, neutral templates, or non-MFT moral scenarios. This is load-bearing for the central assertion that the geometry 'naturally emerges from pretraining.'

    Authors: We agree that explicit controls are required to rule out surface-level prompt matching and thereby support the claim of natural emergence from pretraining. In the revised manuscript we will add a new control subsection to the Methods. We will report three sets of experiments: (i) scrambled prompts in which moral-foundation labels are randomly reassigned while preserving lexical content, (ii) neutral templates that omit all MFT-specific wording, and (iii) prompts drawn from non-MFT ethical scenarios. For each core method we will quantify the drop in alignment, feature activation, and steering efficacy under these controls relative to the original MFT prompts. We will also include base-model-only results to further isolate pretraining effects. These additions will be integrated into the layer-wise, SAE, and intervention analyses and will be accompanied by statistical tests. revision: yes

Circularity Check

0 steps flagged

No significant circularity: claims rest on external human benchmarks and causal interventions

full rationale

The paper's central derivation uses MFT as an external analytic framework to probe representations via layer-wise analysis, SAE feature extraction, and steering interventions. Alignment is measured against independent human moral perception data, and causal effects are tested through interventions on identified vectors/features. None of these steps define moral vectors in terms of the target alignment metric or rename fitted quantities as predictions. No self-citation chains, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing. The analysis is self-contained against external benchmarks and does not reduce its outputs to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis assumes Moral Foundations Theory provides a valid and exhaustive decomposition of moral concepts that can be linearly or sparsely extracted from LLM activations; it further assumes that human moral perception ratings serve as an independent ground truth for alignment measurements.

axioms (1)
  • domain assumption Moral Foundations Theory categories are psychologically valid and can be used as an analytic framework for machine representations
    Invoked throughout the abstract as the organizing structure for all analyses and interventions.

pith-pipeline@v0.9.0 · 5829 in / 1314 out tokens · 39373 ms · 2026-05-21T16:22:27.383243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Psychological Steering of Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Invariant Risk Minimization

    Moral foundations of large language models. InProceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 17737–17752. Suhaib Abdurahman, Mohammad Atari, Farzan Karimi- Malekabadi, Mona J Xue, Jackson Trager, Peter S Park, Preni Golazizian, Ali Omrani, and Morteza De- hghani. 2024. Perils and opportunities in using larg...

  2. [2]

    Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans

    Psychological steering in llms: An evalua- tion of effectiveness and trustworthiness.Preprint, arXiv:2510.04484. Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. 2025. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms.Preprint, arXiv:2502.17424. Joseph Bloom, C...

  3. [3]

    The Llama 3 Herd of Models

    Moral foundations theory: The pragmatic va- lidity of moral pluralism. InAdvances in experi- mental social psychology, volume 47, pages 55–130. Elsevier. Jesse Graham, Brian A Nosek, Jonathan Haidt, Ravi Iyer, Spassena Koleva, and Peter H Ditto. 2011. Map- ping the moral domain.Journal of personality and social psychology, 101(2):366. Aaron Grattafiori, A...

  4. [4]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Joseph Henrich, Steven J Heine, and Ara Norenzayan

  5. [5]

    Joe Hoover, Mohammad Atari, Aida Mostafazadeh Da- vani, Brendan Kennedy, Gwenyth Portillo-Wightman, Leigh Yeh, and Morteza Dehghani

    The weirdest people in the world?Behavioral and brain sciences, 33(2-3):61–83. Joe Hoover, Mohammad Atari, Aida Mostafazadeh Da- vani, Brendan Kennedy, Gwenyth Portillo-Wightman, Leigh Yeh, and Morteza Dehghani. 2021. Investi- gating the role of group-based morality in extreme behavioral expressions of prejudice.Nature Commu- nications, 12(1):4585. Joe Ho...

  6. [6]

    Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, and Yongfeng Zhang

    Moral foundations elicit shared and dissocia- ble cortical activation modulated by political ideol- ogy.Nature Human Behaviour, 7(12):2182–2198. Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, and Yongfeng Zhang. 2025. Moral- bench: Moral evaluation of llms.ACM SIGKDD Explorations Newsletter, 27(1):62–71. Behnam Karami, Fatemeh Zandi, and Ja...

  7. [7]

    Adam Karvonen

    Emergent moral representations in large lan- guage models aligns with human conceptual, neural, and behavioral moral structure.Research Square Preprint. Adam Karvonen. 2024. An intuitive explanation of sparse autoencoders for llm interpretability. Brendan Kennedy, Mohammad Atari, Aida Mostafazadeh Davani, Joe Hoover, Ali Omrani, Jesse Graham, and Morteza Dehghani

  8. [8]

    Ari Khoudary, Eleanor Hanna, Kevin O’Neill, Vi- jeth Iyengar, Scott Clifford, Roberto Cabeza, Felipe De Brigard, and Walter Sinnott-Armstrong

    Moral concerns are differentially observable in language.Cognition, 212:104696. Ari Khoudary, Eleanor Hanna, Kevin O’Neill, Vi- jeth Iyengar, Scott Clifford, Roberto Cabeza, Felipe De Brigard, and Walter Sinnott-Armstrong. 2022. A functional neuroimaging investigation of moral foun- dations theory.Social Neuroscience, 17(6):491–507. Been Kim, Martin Watte...

  9. [9]

    GitHub repository

    Dictionary learning. GitHub repository. https://github.com/saprmarks/dictionary_ learning. Meta. 2024. meta-llama/llama-3.1-8b-instruct. https://huggingface.co/meta-llama/Llama-3. 1-8B-Instruct. Hugging Face model. Accessed: 2025-12-31. 11 Richard Ngo, Lawrence Chan, and Sören Mindermann

  10. [10]

    terminal peak

    The alignment problem from a deep learning perspective. InThe Twelfth International Conference on Learning Representations. José Luiz Nunes, Guilherme F. C. F. Almeida, Marcelo de Araujo, and Simone D. J. Barbosa. 2024. Are large language models moral hypocrites? a study based on moral foundations.arXiv preprint arXiv:2409.01955. Dino Pedreschi, Fosca Gia...

  11. [11]

    Neutral Description First:Describe the dom- inant pattern (topic, style, rhetorical function, or social behavior) neutrally

  12. [12]

    Otherwise, output mft_alignment="none"

    Conservative MFT Mapping:Map to a Moral Foundations Theory categoryonly if strongly supported. Otherwise, output mft_alignment="none". Do not force moral- ity; many features are not moral

  13. [13]

    Format:Provide a short label (5–10 words) and a 1–2 sentence description

  14. [14]

    Moral Foundations Theory (MFT) definitions: • Care/harm:dislike others’ suffering; kind- ness, gentleness, nurturance vs cruelty, vio- lence

    Citations:Cite evidence_ids (indices of snippets) that justify your decision. Moral Foundations Theory (MFT) definitions: • Care/harm:dislike others’ suffering; kind- ness, gentleness, nurturance vs cruelty, vio- lence. • Fairness/cheating:justice, rights, autonomy vs fraud, exploitation, cheating. • Loyalty/betrayal:group allegiance, patrio- tism, self-s...

  15. [15]

    Form:Each item must be a single sentence (≤25 words), plain language, observational tone

  16. [16]

    Content Constraints:Emotional harm only; no physical harm, threats, authority roles, or in-group/out-group dynamics

  17. [17]

    Social Context:Use strangers or minimal rela- tionships; avoid family, close friends, or hierar- chical roles

  18. [18]

    You see

    Style:Mirror original MFV phrasing (e.g., be- gin with “You see . . . ”)

  19. [19]

    Child labor has no place in the production of

    Subjects:Use generic actors (man, woman, boy, girl, person, teen); avoid names and pro- tected attributes as targets. Diversity Requirement (Coverage Grid):Gener- ate exactly120 itemsorganized as10 themes × 12 items, covering distinct everyday contexts (e.g., public transit, workplaces without hierarchy, online spaces, social mixers). Output Format:Return...