pith. machine review for the scientific record. sign in

arxiv: 2601.05437 · v2 · submitted 2026-01-09 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Tracing Moral Foundations in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords moral foundationslarge language modelsmechanistic interpretabilitysparse autoencodersmoral judgmentpretrainingsteering interventionsMoral Foundations Theory
0
0 comments X

The pith

Large language models develop internal representations of moral foundations that align with human judgments and emerge naturally during pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether LLMs produce moral judgments through genuine internal structures or surface-level imitation. It analyzes 14 models across families and scales using Moral Foundations Theory to track how these concepts are represented layer by layer, isolated via sparse autoencoders, and altered through targeted steering. The central finding is that a moral geometry forms from pretraining data patterns, gets adjusted during instruction tuning, and supports causal interventions that shift moral outputs in predictable ways. A sympathetic reader would care because this suggests pluralistic moral reasoning can arise as a byproduct of language statistics rather than explicit design. The results point to distributed, partly disentangled mechanisms inside the models that connect representation to behavior.

Core claim

Models represent and distinguish moral foundations in a manner that aligns with human judgments, and this moral geometry naturally emerges from pretraining and is selectively rewired by post-training. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs.

What carries the argument

Moral Foundations Theory framework applied through layer-wise representation analysis, pretrained sparse autoencoders on the residual stream, and causal steering interventions with both dense vectors and sparse features.

If this is right

  • Moral concepts appear distributed across layers and partly disentangled within shared representations.
  • The moral geometry forms as a latent pattern from statistical regularities in language data alone.
  • Post-training selectively rewires existing moral representations rather than building them from scratch.
  • Causal links exist between identified internal features and observable moral behavior in model outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Editing moral outputs could be achieved by targeted vector or feature steering instead of full retraining.
  • Similar latent structures might appear for other value systems or ethical frameworks not covered by Moral Foundations Theory.
  • Training data composition could directly shape which moral distinctions models prioritize, creating measurable biases.
  • Cross-model comparisons might reveal whether moral geometry scales consistently with parameter count or architecture.

Load-bearing premise

Moral Foundations Theory categories and SAE features accurately reflect the models' genuine internal moral concepts rather than matching surface patterns or imposed external labels.

What would settle it

Steering interventions using the identified dense vectors or sparse SAE features fail to produce consistent, predictable shifts in the models' moral judgments or foundation-relevant outputs.

Figures

Figures reproduced from arXiv: 2601.05437 by Bowen Yi, Chenxiao Yu, Farzan Karimi-Malekabadi, Jinyi Ye, Morteza Dehghani, Shrikanth Narayanan, Suhaib Abdurahman, Yue Zhao.

Figure 1
Figure 1. Figure 1: Overview of the experimental pipeline. (i) Relative moral concept vectors are constructed from extended Moral Foundations vignettes and serve as a central representational hub. These vectors are validated in parallel through (ii) topological alignment with human-labeled Reddit post distributions and (iii) mechanistic decomposition into sparse autoencoder features. (iv) We then causally intervene on model a… view at source ↗
Figure 2
Figure 2. Figure 2: The geometry of moral alignment in LLAMA. We project human-labeled Reddit posts for each moral foundation, and non-moral data, onto the corresponding foundation-vs.-Social Norm vectors. In contrast, Qwen exhibits a “U-shaped” trajectory, with alignment peaking in early layers (Layer 3) and resurging in deep layers (Layer 23) after a mid￾layer dip. Despite these differences, both models exhibit consistent r… view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise Alignment of Moral Features in LLAMA. Average cosine similarity of the top-3 most aligned SAE features for each Moral Foundation across every 4 layers vs random baselines. Similarity is calcu￾lated between the SAE decoder weights and the corre￾sponding Foundation vs. Social Norms concept vector. steering reveals a clear asymmetry in steerability across moral foundations. Care, Sanctity, and Fair… view at source ↗
Figure 4
Figure 4. Figure 4: Steering results. For each foundation, we plot the MFQ-2 score change ∆Score(α) relative to the unsteered baseline (α = 0) as a function of steering strength α, evaluated at that foundation’s best layer. Points show measured ∆ scores and the solid line shows the corresponding linear trend. The gray dashed line reports general performance (MMLU) under the same interventions. See Appendix B.3 for details. un… view at source ↗
Figure 5
Figure 5. Figure 5: The geometry of moral alignment in Qwen￾2.5-7B-Instruct. We project human-labeled Reddit posts for each moral foundation, and non-moral data, onto the corresponding foundation-vs.-Social Norm vec￾tors. suggest three takeaways. First, in both models, Care and Sanctity emerge as relatively distinct di￾rections in representation space. This finding is consistent with human cross-cultural studies, which show t… view at source ↗
Figure 6
Figure 6. Figure 6: Layer-wise Alignment of Moral Features (Qwen). Average cosine similarity of the top-3 most aligned SAE features for each Moral Foundation across every 4 layers vs random baselines. Similarity is calcu￾lated between the SAE decoder weights and the corre￾sponding Foundation vs. Social Norms concept vector. SAEs and the foundation-specific concept vectors derived from the residual stream (Section 3.3). Based … view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template used for automated interpre [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Steering result (combined). For each foun￾dation, we plot the MFQ-2 score change ∆Score(α) relative to the unsteered baseline (α = 0) as a function of steering strength α, evaluated at that foundation’s best layer. The best layer is chosen as the layer with the largest positive linear response slope. Points show measured ∆ scores and the solid line shows the cor￾responding linear trend. The gray dashed lin… view at source ↗
Figure 11
Figure 11. Figure 11: Slope Magnitude Across Layers (Macro) (Qwen). Absolute steering slopes |kf,l| (|βf,l|)across layers for each foundation, highlighting where steering has the strongest sensitivity regardless of direction. 5 10 15 20 25 Layer Index (0-based) 0.4 0.2 0.0 0.2 0.4 0.6 Slo p e k f, l ( N o r m aliz e d ) L11 L19 QWEN - normalized (÷50) Authority Care Fairness Loyalty Sanctity [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗
Figure 12
Figure 12. Figure 12: Slope Magnitude Across Layers (Micro) (Qwen). Absolute steering slopes |kf,l| (|βf,l|) across layers for each foundation, highlighting where steering has the strongest sensitivity regardless of direction. Logit-based multiple-choice scoring. To obtain deterministic and reproducible measurements, we score MMLU in a logit-based manner rather than via free-form generation. For each question, we perform a sin… view at source ↗
Figure 13
Figure 13. Figure 13: Prompt template used to expand the Moral [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

Large language models often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed across 14 base and instruction-tuned LLMs spanning four model families (Llama, Qwen2.5, Qwen3-MoE, Mistral) and scales from 7B to 70B. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that models represent and distinguish moral foundations in a manner that aligns with human judgments, and that this moral geometry naturally emerges from pretraining and is selectively rewired by post-training. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs represent and distinguish moral foundations from Moral Foundations Theory (MFT) in a manner aligning with human judgments. This moral geometry emerges naturally from pretraining, is selectively rewired by post-training, and is supported by sparse SAE features with semantic links to specific foundations. Causal steering via dense MFT vectors or sparse SAE features produces predictable shifts in foundation-relevant behavior, providing mechanistic evidence that moral concepts are distributed, layered, and partly disentangled across 14 models from Llama, Qwen, and Mistral families.

Significance. If the results hold, the work supplies mechanistic evidence that moral concepts in LLMs arise as latent patterns from language statistics rather than pure mimicry. The multi-level pipeline—layer-wise analysis, pretrained SAEs, and causal steering—strengthens interpretability claims by linking representations to behavior. This advances understanding of how pluralistic moral structures can emerge in models and has implications for alignment research.

major comments (2)
  1. [Methods and §3] Methods and §3 (Layer-wise Analysis): The central claim that moral geometry 'naturally emerges from pretraining' and aligns with human judgments rests on MFT-defined stimuli and post-hoc feature labeling. Without controls using alternative taxonomies or fully unsupervised discovery of moral directions, the alignment metrics and semantic links risk being driven by the analytic framework rather than intrinsic residual-stream structure. This is load-bearing for distinguishing faithful capture from projection.
  2. [§5] §5 (Steering Interventions): Steering along dense vectors or sparse SAE features produces behavioral shifts, but the evaluation uses the same MFT probes as the representation analysis. No ablation isolates whether the vectors/features encode the claimed foundations independently of those probes, weakening the causal connection between internal representations and moral outputs.
minor comments (2)
  1. [Figure 2] Figure 2: Add statistical significance markers or confidence intervals to the layer-wise alignment plots to allow readers to assess whether reported correlations exceed chance levels.
  2. [§2] The term 'moral geometry' is used throughout without a precise operational definition (e.g., in terms of cosine distances or activation subspaces); a short formalization in §2 would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important considerations for strengthening our claims about moral representations in LLMs. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Methods and §3] Methods and §3 (Layer-wise Analysis): The central claim that moral geometry 'naturally emerges from pretraining' and aligns with human judgments rests on MFT-defined stimuli and post-hoc feature labeling. Without controls using alternative taxonomies or fully unsupervised discovery of moral directions, the alignment metrics and semantic links risk being driven by the analytic framework rather than intrinsic residual-stream structure. This is load-bearing for distinguishing faithful capture from projection.

    Authors: We agree that grounding the analysis in MFT stimuli creates a risk of framework-driven results rather than purely intrinsic structure. The alignment metrics rely on external human MFT survey data for validation, and emergence from pretraining is supported by comparative layer-wise patterns between base and post-trained models. To address the concern directly, we will add a dedicated limitations subsection in §3 discussing the choice of MFT versus alternative taxonomies (e.g., Schwartz values) and include new controls such as random direction baselines and shuffled-label ablations to test specificity. Fully unsupervised discovery of moral directions would require substantial new experiments beyond the current scope, but the added controls and discussion will mitigate projection risks. This is a partial revision. revision: partial

  2. Referee: [§5] §5 (Steering Interventions): Steering along dense vectors or sparse SAE features produces behavioral shifts, but the evaluation uses the same MFT probes as the representation analysis. No ablation isolates whether the vectors/features encode the claimed foundations independently of those probes, weakening the causal connection between internal representations and moral outputs.

    Authors: We acknowledge that reusing the same MFT probes for both representation analysis and steering evaluation limits the independence of the causal evidence. To strengthen this link, we will incorporate new ablations in the revised §5: (i) testing steering vectors on novel moral scenarios and dilemmas not present in the original probes, and (ii) adding non-moral control steering vectors (e.g., from factual or sentiment tasks) to demonstrate specificity of foundation-relevant shifts. These additions will be supported by quantitative metrics comparing effect sizes. This addresses the core concern without requiring a full redesign of the pipeline and constitutes a partial revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical measurements

full rationale

The paper's derivation chain consists of layer-wise activation analysis, pretrained SAE feature extraction, and causal steering interventions, all evaluated against external human judgment benchmarks. These steps produce measurable outputs (alignment scores, semantic links, behavioral shifts) that are not equivalent to the input stimuli or MFT categories by construction. MFT serves as an external analytic lens drawn from established psychological literature rather than a self-defined or author-derived ansatz. No equations reduce fitted parameters to predictions, no self-citation chains bear the central claims, and no uniqueness theorems or renamings are invoked. The results remain falsifiable via alternative taxonomies or unsupervised methods, keeping the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on the validity of Moral Foundations Theory as a psychologically grounded taxonomy and on the assumption that pretrained SAEs recover semantically meaningful features; no new entities are postulated and no free parameters are explicitly fitted beyond standard training choices.

axioms (2)
  • domain assumption Moral Foundations Theory provides a valid, cross-culturally applicable decomposition of human moral concepts.
    Invoked throughout as the analytic framework for labeling representations and behaviors.
  • domain assumption Pretrained sparse autoencoders on residual streams extract interpretable, disentangled features corresponding to semantic concepts.
    Used to identify sparse features linked to specific moral foundations.

pith-pipeline@v0.9.0 · 5598 in / 1434 out tokens · 52375 ms · 2026-05-16T16:59:48.755039+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Psychological Steering of Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Invariant Risk Minimization

    Moral foundations of large language models. InProceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 17737–17752. Suhaib Abdurahman, Mohammad Atari, Farzan Karimi- Malekabadi, Mona J Xue, Jackson Trager, Peter S Park, Preni Golazizian, Ali Omrani, and Morteza De- hghani. 2024. Perils and opportunities in using larg...

  2. [2]

    Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans

    Psychological steering in llms: An evalua- tion of effectiveness and trustworthiness.Preprint, arXiv:2510.04484. Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. 2025. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms.Preprint, arXiv:2502.17424. Joseph Bloom, C...

  3. [3]

    The Llama 3 Herd of Models

    Moral foundations theory: The pragmatic va- lidity of moral pluralism. InAdvances in experi- mental social psychology, volume 47, pages 55–130. Elsevier. Jesse Graham, Brian A Nosek, Jonathan Haidt, Ravi Iyer, Spassena Koleva, and Peter H Ditto. 2011. Map- ping the moral domain.Journal of personality and social psychology, 101(2):366. Aaron Grattafiori, A...

  4. [4]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Joseph Henrich, Steven J Heine, and Ara Norenzayan

  5. [5]

    Joe Hoover, Mohammad Atari, Aida Mostafazadeh Da- vani, Brendan Kennedy, Gwenyth Portillo-Wightman, Leigh Yeh, and Morteza Dehghani

    The weirdest people in the world?Behavioral and brain sciences, 33(2-3):61–83. Joe Hoover, Mohammad Atari, Aida Mostafazadeh Da- vani, Brendan Kennedy, Gwenyth Portillo-Wightman, Leigh Yeh, and Morteza Dehghani. 2021. Investi- gating the role of group-based morality in extreme behavioral expressions of prejudice.Nature Commu- nications, 12(1):4585. Joe Ho...

  6. [6]

    Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, and Yongfeng Zhang

    Moral foundations elicit shared and dissocia- ble cortical activation modulated by political ideol- ogy.Nature Human Behaviour, 7(12):2182–2198. Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, and Yongfeng Zhang. 2025. Moral- bench: Moral evaluation of llms.ACM SIGKDD Explorations Newsletter, 27(1):62–71. Behnam Karami, Fatemeh Zandi, and Ja...

  7. [7]

    Adam Karvonen

    Emergent moral representations in large lan- guage models aligns with human conceptual, neural, and behavioral moral structure.Research Square Preprint. Adam Karvonen. 2024. An intuitive explanation of sparse autoencoders for llm interpretability. Brendan Kennedy, Mohammad Atari, Aida Mostafazadeh Davani, Joe Hoover, Ali Omrani, Jesse Graham, and Morteza Dehghani

  8. [8]

    Ari Khoudary, Eleanor Hanna, Kevin O’Neill, Vi- jeth Iyengar, Scott Clifford, Roberto Cabeza, Felipe De Brigard, and Walter Sinnott-Armstrong

    Moral concerns are differentially observable in language.Cognition, 212:104696. Ari Khoudary, Eleanor Hanna, Kevin O’Neill, Vi- jeth Iyengar, Scott Clifford, Roberto Cabeza, Felipe De Brigard, and Walter Sinnott-Armstrong. 2022. A functional neuroimaging investigation of moral foun- dations theory.Social Neuroscience, 17(6):491–507. Been Kim, Martin Watte...

  9. [9]

    GitHub repository

    Dictionary learning. GitHub repository. https://github.com/saprmarks/dictionary_ learning. Meta. 2024. meta-llama/llama-3.1-8b-instruct. https://huggingface.co/meta-llama/Llama-3. 1-8B-Instruct. Hugging Face model. Accessed: 2025-12-31. 11 Richard Ngo, Lawrence Chan, and Sören Mindermann

  10. [10]

    terminal peak

    The alignment problem from a deep learning perspective. InThe Twelfth International Conference on Learning Representations. José Luiz Nunes, Guilherme F. C. F. Almeida, Marcelo de Araujo, and Simone D. J. Barbosa. 2024. Are large language models moral hypocrites? a study based on moral foundations.arXiv preprint arXiv:2409.01955. Dino Pedreschi, Fosca Gia...

  11. [11]

    Neutral Description First:Describe the dom- inant pattern (topic, style, rhetorical function, or social behavior) neutrally

  12. [12]

    Otherwise, output mft_alignment="none"

    Conservative MFT Mapping:Map to a Moral Foundations Theory categoryonly if strongly supported. Otherwise, output mft_alignment="none". Do not force moral- ity; many features are not moral

  13. [13]

    Format:Provide a short label (5–10 words) and a 1–2 sentence description

  14. [14]

    Moral Foundations Theory (MFT) definitions: • Care/harm:dislike others’ suffering; kind- ness, gentleness, nurturance vs cruelty, vio- lence

    Citations:Cite evidence_ids (indices of snippets) that justify your decision. Moral Foundations Theory (MFT) definitions: • Care/harm:dislike others’ suffering; kind- ness, gentleness, nurturance vs cruelty, vio- lence. • Fairness/cheating:justice, rights, autonomy vs fraud, exploitation, cheating. • Loyalty/betrayal:group allegiance, patrio- tism, self-s...

  15. [15]

    Form:Each item must be a single sentence (≤25 words), plain language, observational tone

  16. [16]

    Content Constraints:Emotional harm only; no physical harm, threats, authority roles, or in-group/out-group dynamics

  17. [17]

    Social Context:Use strangers or minimal rela- tionships; avoid family, close friends, or hierar- chical roles

  18. [18]

    You see

    Style:Mirror original MFV phrasing (e.g., be- gin with “You see . . . ”)

  19. [19]

    Child labor has no place in the production of

    Subjects:Use generic actors (man, woman, boy, girl, person, teen); avoid names and pro- tected attributes as targets. Diversity Requirement (Coverage Grid):Gener- ate exactly120 itemsorganized as10 themes × 12 items, covering distinct everyday contexts (e.g., public transit, workplaces without hierarchy, online spaces, social mixers). Output Format:Return...