arxiv: 2601.05437 · v2 · submitted 2026-01-09 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Tracing Moral Foundations in Large Language Models

Chenxiao Yu , Bowen Yi , Farzan Karimi-Malekabadi , Suhaib Abdurahman , Jinyi Ye , Shrikanth Narayanan , Yue Zhao , Morteza Dehghani

Authors on Pith no claims yet

Pith reviewed 2026-05-16 16:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords moral foundationslarge language modelsmechanistic interpretabilitysparse autoencodersmoral judgmentpretrainingsteering interventionsMoral Foundations Theory

0 comments

The pith

Large language models develop internal representations of moral foundations that align with human judgments and emerge naturally during pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether LLMs produce moral judgments through genuine internal structures or surface-level imitation. It analyzes 14 models across families and scales using Moral Foundations Theory to track how these concepts are represented layer by layer, isolated via sparse autoencoders, and altered through targeted steering. The central finding is that a moral geometry forms from pretraining data patterns, gets adjusted during instruction tuning, and supports causal interventions that shift moral outputs in predictable ways. A sympathetic reader would care because this suggests pluralistic moral reasoning can arise as a byproduct of language statistics rather than explicit design. The results point to distributed, partly disentangled mechanisms inside the models that connect representation to behavior.

Core claim

Models represent and distinguish moral foundations in a manner that aligns with human judgments, and this moral geometry naturally emerges from pretraining and is selectively rewired by post-training. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs.

What carries the argument

Moral Foundations Theory framework applied through layer-wise representation analysis, pretrained sparse autoencoders on the residual stream, and causal steering interventions with both dense vectors and sparse features.

If this is right

Moral concepts appear distributed across layers and partly disentangled within shared representations.
The moral geometry forms as a latent pattern from statistical regularities in language data alone.
Post-training selectively rewires existing moral representations rather than building them from scratch.
Causal links exist between identified internal features and observable moral behavior in model outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Editing moral outputs could be achieved by targeted vector or feature steering instead of full retraining.
Similar latent structures might appear for other value systems or ethical frameworks not covered by Moral Foundations Theory.
Training data composition could directly shape which moral distinctions models prioritize, creating measurable biases.
Cross-model comparisons might reveal whether moral geometry scales consistently with parameter count or architecture.

Load-bearing premise

Moral Foundations Theory categories and SAE features accurately reflect the models' genuine internal moral concepts rather than matching surface patterns or imposed external labels.

What would settle it

Steering interventions using the identified dense vectors or sparse SAE features fail to produce consistent, predictable shifts in the models' moral judgments or foundation-relevant outputs.

Figures

Figures reproduced from arXiv: 2601.05437 by Bowen Yi, Chenxiao Yu, Farzan Karimi-Malekabadi, Jinyi Ye, Morteza Dehghani, Shrikanth Narayanan, Suhaib Abdurahman, Yue Zhao.

**Figure 1.** Figure 1: Overview of the experimental pipeline. (i) Relative moral concept vectors are constructed from extended Moral Foundations vignettes and serve as a central representational hub. These vectors are validated in parallel through (ii) topological alignment with human-labeled Reddit post distributions and (iii) mechanistic decomposition into sparse autoencoder features. (iv) We then causally intervene on model a… view at source ↗

**Figure 2.** Figure 2: The geometry of moral alignment in LLAMA. We project human-labeled Reddit posts for each moral foundation, and non-moral data, onto the corresponding foundation-vs.-Social Norm vectors. In contrast, Qwen exhibits a “U-shaped” trajectory, with alignment peaking in early layers (Layer 3) and resurging in deep layers (Layer 23) after a midlayer dip. Despite these differences, both models exhibit consistent r… view at source ↗

**Figure 3.** Figure 3: Layer-wise Alignment of Moral Features in LLAMA. Average cosine similarity of the top-3 most aligned SAE features for each Moral Foundation across every 4 layers vs random baselines. Similarity is calculated between the SAE decoder weights and the corresponding Foundation vs. Social Norms concept vector. steering reveals a clear asymmetry in steerability across moral foundations. Care, Sanctity, and Fair… view at source ↗

**Figure 4.** Figure 4: Steering results. For each foundation, we plot the MFQ-2 score change ∆Score(α) relative to the unsteered baseline (α = 0) as a function of steering strength α, evaluated at that foundation’s best layer. Points show measured ∆ scores and the solid line shows the corresponding linear trend. The gray dashed line reports general performance (MMLU) under the same interventions. See Appendix B.3 for details. un… view at source ↗

**Figure 5.** Figure 5: The geometry of moral alignment in Qwen2.5-7B-Instruct. We project human-labeled Reddit posts for each moral foundation, and non-moral data, onto the corresponding foundation-vs.-Social Norm vectors. suggest three takeaways. First, in both models, Care and Sanctity emerge as relatively distinct directions in representation space. This finding is consistent with human cross-cultural studies, which show t… view at source ↗

**Figure 6.** Figure 6: Layer-wise Alignment of Moral Features (Qwen). Average cosine similarity of the top-3 most aligned SAE features for each Moral Foundation across every 4 layers vs random baselines. Similarity is calculated between the SAE decoder weights and the corresponding Foundation vs. Social Norms concept vector. SAEs and the foundation-specific concept vectors derived from the residual stream (Section 3.3). Based … view at source ↗

**Figure 7.** Figure 7: Prompt template used for automated interpre [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Steering result (combined). For each foundation, we plot the MFQ-2 score change ∆Score(α) relative to the unsteered baseline (α = 0) as a function of steering strength α, evaluated at that foundation’s best layer. The best layer is chosen as the layer with the largest positive linear response slope. Points show measured ∆ scores and the solid line shows the corresponding linear trend. The gray dashed lin… view at source ↗

**Figure 11.** Figure 11: Slope Magnitude Across Layers (Macro) (Qwen). Absolute steering slopes |kf,l| (|βf,l|)across layers for each foundation, highlighting where steering has the strongest sensitivity regardless of direction. 5 10 15 20 25 Layer Index (0-based) 0.4 0.2 0.0 0.2 0.4 0.6 Slo p e k f, l ( N o r m aliz e d ) L11 L19 QWEN - normalized (÷50) Authority Care Fairness Loyalty Sanctity [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

**Figure 12.** Figure 12: Slope Magnitude Across Layers (Micro) (Qwen). Absolute steering slopes |kf,l| (|βf,l|) across layers for each foundation, highlighting where steering has the strongest sensitivity regardless of direction. Logit-based multiple-choice scoring. To obtain deterministic and reproducible measurements, we score MMLU in a logit-based manner rather than via free-form generation. For each question, we perform a sin… view at source ↗

**Figure 13.** Figure 13: Prompt template used to expand the Moral [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

Large language models often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed across 14 base and instruction-tuned LLMs spanning four model families (Llama, Qwen2.5, Qwen3-MoE, Mistral) and scales from 7B to 70B. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that models represent and distinguish moral foundations in a manner that aligns with human judgments, and that this moral geometry naturally emerges from pretraining and is selectively rewired by post-training. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Moral foundations show up in LLM activations across models and can be steered, but the MFT framework might be doing some of the heavy lifting.

read the letter

The main thing to know is that the paper finds moral foundation representations in LLMs that line up with human judgments, emerging from pretraining and changeable via steering on both dense vectors and SAE features across 14 models in four families. They combine layer-wise probes, pretrained SAEs on residual streams, and causal interventions, which gives a wider view than single-model studies on this topic. The scale and the mix of dense plus sparse methods are the clearest step forward, and the steering results add some causal grounding that goes beyond correlations. The SAE features showing semantic ties to specific foundations is a useful detail for thinking about how these concepts sit inside the models. On the downside, the choice of Moral Foundations Theory as the starting point creates a real risk that the alignments and feature interpretations are shaped by the prompts and post-hoc labels rather than revealing the models' intrinsic structure. The stress-test concern holds up here; without checks against other taxonomies or fully unsupervised directions, it's hard to separate faithful capture from framework projection. The abstract also leaves out precise details on data rules, thresholds, and steering sizes, so robustness is tough to judge fully. This is worth a serious referee for groups working on value interpretability and alignment, since the methods are concrete and the multi-model evidence is there even if some controls need tightening. Send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs represent and distinguish moral foundations from Moral Foundations Theory (MFT) in a manner aligning with human judgments. This moral geometry emerges naturally from pretraining, is selectively rewired by post-training, and is supported by sparse SAE features with semantic links to specific foundations. Causal steering via dense MFT vectors or sparse SAE features produces predictable shifts in foundation-relevant behavior, providing mechanistic evidence that moral concepts are distributed, layered, and partly disentangled across 14 models from Llama, Qwen, and Mistral families.

Significance. If the results hold, the work supplies mechanistic evidence that moral concepts in LLMs arise as latent patterns from language statistics rather than pure mimicry. The multi-level pipeline—layer-wise analysis, pretrained SAEs, and causal steering—strengthens interpretability claims by linking representations to behavior. This advances understanding of how pluralistic moral structures can emerge in models and has implications for alignment research.

major comments (2)

[Methods and §3] Methods and §3 (Layer-wise Analysis): The central claim that moral geometry 'naturally emerges from pretraining' and aligns with human judgments rests on MFT-defined stimuli and post-hoc feature labeling. Without controls using alternative taxonomies or fully unsupervised discovery of moral directions, the alignment metrics and semantic links risk being driven by the analytic framework rather than intrinsic residual-stream structure. This is load-bearing for distinguishing faithful capture from projection.
[§5] §5 (Steering Interventions): Steering along dense vectors or sparse SAE features produces behavioral shifts, but the evaluation uses the same MFT probes as the representation analysis. No ablation isolates whether the vectors/features encode the claimed foundations independently of those probes, weakening the causal connection between internal representations and moral outputs.

minor comments (2)

[Figure 2] Figure 2: Add statistical significance markers or confidence intervals to the layer-wise alignment plots to allow readers to assess whether reported correlations exceed chance levels.
[§2] The term 'moral geometry' is used throughout without a precise operational definition (e.g., in terms of cosine distances or activation subspaces); a short formalization in §2 would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important considerations for strengthening our claims about moral representations in LLMs. We address each major comment point by point below.

read point-by-point responses

Referee: [Methods and §3] Methods and §3 (Layer-wise Analysis): The central claim that moral geometry 'naturally emerges from pretraining' and aligns with human judgments rests on MFT-defined stimuli and post-hoc feature labeling. Without controls using alternative taxonomies or fully unsupervised discovery of moral directions, the alignment metrics and semantic links risk being driven by the analytic framework rather than intrinsic residual-stream structure. This is load-bearing for distinguishing faithful capture from projection.

Authors: We agree that grounding the analysis in MFT stimuli creates a risk of framework-driven results rather than purely intrinsic structure. The alignment metrics rely on external human MFT survey data for validation, and emergence from pretraining is supported by comparative layer-wise patterns between base and post-trained models. To address the concern directly, we will add a dedicated limitations subsection in §3 discussing the choice of MFT versus alternative taxonomies (e.g., Schwartz values) and include new controls such as random direction baselines and shuffled-label ablations to test specificity. Fully unsupervised discovery of moral directions would require substantial new experiments beyond the current scope, but the added controls and discussion will mitigate projection risks. This is a partial revision. revision: partial
Referee: [§5] §5 (Steering Interventions): Steering along dense vectors or sparse SAE features produces behavioral shifts, but the evaluation uses the same MFT probes as the representation analysis. No ablation isolates whether the vectors/features encode the claimed foundations independently of those probes, weakening the causal connection between internal representations and moral outputs.

Authors: We acknowledge that reusing the same MFT probes for both representation analysis and steering evaluation limits the independence of the causal evidence. To strengthen this link, we will incorporate new ablations in the revised §5: (i) testing steering vectors on novel moral scenarios and dilemmas not present in the original probes, and (ii) adding non-moral control steering vectors (e.g., from factual or sentiment tasks) to demonstrate specificity of foundation-relevant shifts. These additions will be supported by quantitative metrics comparing effect sizes. This addresses the core concern without requiring a full redesign of the pipeline and constitutes a partial revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical measurements

full rationale

The paper's derivation chain consists of layer-wise activation analysis, pretrained SAE feature extraction, and causal steering interventions, all evaluated against external human judgment benchmarks. These steps produce measurable outputs (alignment scores, semantic links, behavioral shifts) that are not equivalent to the input stimuli or MFT categories by construction. MFT serves as an external analytic lens drawn from established psychological literature rather than a self-defined or author-derived ansatz. No equations reduce fitted parameters to predictions, no self-citation chains bear the central claims, and no uniqueness theorems or renamings are invoked. The results remain falsifiable via alternative taxonomies or unsupervised methods, keeping the analysis self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work relies on the validity of Moral Foundations Theory as a psychologically grounded taxonomy and on the assumption that pretrained SAEs recover semantically meaningful features; no new entities are postulated and no free parameters are explicitly fitted beyond standard training choices.

axioms (2)

domain assumption Moral Foundations Theory provides a valid, cross-culturally applicable decomposition of human moral concepts.
Invoked throughout as the analytic framework for labeling representations and behaviors.
domain assumption Pretrained sparse autoencoders on residual streams extract interpretable, disentangled features corresponding to semantic concepts.
Used to identify sparse features linked to specific moral foundations.

pith-pipeline@v0.9.0 · 5598 in / 1434 out tokens · 52375 ms · 2026-05-16T16:59:48.755039+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded... layer-wise analysis of MFT concept representations... pretrained sparse autoencoders (SAEs)... causal steering interventions
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we demonstrate a robust representational alignment between LLM latent spaces and human moral perceptions... geometric separability provides computational support for pluralist theories

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Psychological Steering of Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Invariant Risk Minimization

Moral foundations of large language models. InProceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 17737–17752. Suhaib Abdurahman, Mohammad Atari, Farzan Karimi- Malekabadi, Mona J Xue, Jackson Trager, Peter S Park, Preni Golazizian, Ali Omrani, and Morteza De- hghani. 2024. Perils and opportunities in using larg...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans

Psychological steering in llms: An evalua- tion of effectiveness and trustworthiness.Preprint, arXiv:2510.04484. Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. 2025. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms.Preprint, arXiv:2502.17424. Joseph Bloom, C...

work page arXiv 2025
[3]

The Llama 3 Herd of Models

Moral foundations theory: The pragmatic va- lidity of moral pluralism. InAdvances in experi- mental social psychology, volume 47, pages 55–130. Elsevier. Jesse Graham, Brian A Nosek, Jonathan Haidt, Ravi Iyer, Spassena Koleva, and Peter H Ditto. 2011. Map- ping the moral domain.Journal of personality and social psychology, 101(2):366. Aaron Grattafiori, A...

work page internal anchor Pith review Pith/arXiv arXiv 2011
[4]

Measuring Massive Multitask Language Understanding

Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Joseph Henrich, Steven J Heine, and Ara Norenzayan

work page internal anchor Pith review Pith/arXiv arXiv 2009
[5]

Joe Hoover, Mohammad Atari, Aida Mostafazadeh Da- vani, Brendan Kennedy, Gwenyth Portillo-Wightman, Leigh Yeh, and Morteza Dehghani

The weirdest people in the world?Behavioral and brain sciences, 33(2-3):61–83. Joe Hoover, Mohammad Atari, Aida Mostafazadeh Da- vani, Brendan Kennedy, Gwenyth Portillo-Wightman, Leigh Yeh, and Morteza Dehghani. 2021. Investi- gating the role of group-based morality in extreme behavioral expressions of prejudice.Nature Commu- nications, 12(1):4585. Joe Ho...

work page 2021
[6]

Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, and Yongfeng Zhang

Moral foundations elicit shared and dissocia- ble cortical activation modulated by political ideol- ogy.Nature Human Behaviour, 7(12):2182–2198. Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, and Yongfeng Zhang. 2025. Moral- bench: Moral evaluation of llms.ACM SIGKDD Explorations Newsletter, 27(1):62–71. Behnam Karami, Fatemeh Zandi, and Ja...

work page 2025
[7]

Adam Karvonen

Emergent moral representations in large lan- guage models aligns with human conceptual, neural, and behavioral moral structure.Research Square Preprint. Adam Karvonen. 2024. An intuitive explanation of sparse autoencoders for llm interpretability. Brendan Kennedy, Mohammad Atari, Aida Mostafazadeh Davani, Joe Hoover, Ali Omrani, Jesse Graham, and Morteza Dehghani

work page 2024
[8]

Ari Khoudary, Eleanor Hanna, Kevin O’Neill, Vi- jeth Iyengar, Scott Clifford, Roberto Cabeza, Felipe De Brigard, and Walter Sinnott-Armstrong

Moral concerns are differentially observable in language.Cognition, 212:104696. Ari Khoudary, Eleanor Hanna, Kevin O’Neill, Vi- jeth Iyengar, Scott Clifford, Roberto Cabeza, Felipe De Brigard, and Walter Sinnott-Armstrong. 2022. A functional neuroimaging investigation of moral foun- dations theory.Social Neuroscience, 17(6):491–507. Been Kim, Martin Watte...

work page 2022
[9]

GitHub repository

Dictionary learning. GitHub repository. https://github.com/saprmarks/dictionary_ learning. Meta. 2024. meta-llama/llama-3.1-8b-instruct. https://huggingface.co/meta-llama/Llama-3. 1-8B-Instruct. Hugging Face model. Accessed: 2025-12-31. 11 Richard Ngo, Lawrence Chan, and Sören Mindermann

work page 2024
[10]

terminal peak

The alignment problem from a deep learning perspective. InThe Twelfth International Conference on Learning Representations. José Luiz Nunes, Guilherme F. C. F. Almeida, Marcelo de Araujo, and Simone D. J. Barbosa. 2024. Are large language models moral hypocrites? a study based on moral foundations.arXiv preprint arXiv:2409.01955. Dino Pedreschi, Fosca Gia...

work page arXiv 2024
[11]

Neutral Description First:Describe the dom- inant pattern (topic, style, rhetorical function, or social behavior) neutrally

work page
[12]

Otherwise, output mft_alignment="none"

Conservative MFT Mapping:Map to a Moral Foundations Theory categoryonly if strongly supported. Otherwise, output mft_alignment="none". Do not force moral- ity; many features are not moral

work page
[13]

Format:Provide a short label (5–10 words) and a 1–2 sentence description

work page
[14]

Moral Foundations Theory (MFT) definitions: • Care/harm:dislike others’ suffering; kind- ness, gentleness, nurturance vs cruelty, vio- lence

Citations:Cite evidence_ids (indices of snippets) that justify your decision. Moral Foundations Theory (MFT) definitions: • Care/harm:dislike others’ suffering; kind- ness, gentleness, nurturance vs cruelty, vio- lence. • Fairness/cheating:justice, rights, autonomy vs fraud, exploitation, cheating. • Loyalty/betrayal:group allegiance, patrio- tism, self-s...

work page 2020
[15]

Form:Each item must be a single sentence (≤25 words), plain language, observational tone

work page
[16]

Content Constraints:Emotional harm only; no physical harm, threats, authority roles, or in-group/out-group dynamics

work page
[17]

Social Context:Use strangers or minimal rela- tionships; avoid family, close friends, or hierar- chical roles

work page
[18]

You see

Style:Mirror original MFV phrasing (e.g., be- gin with “You see . . . ”)

work page
[19]

Child labor has no place in the production of

Subjects:Use generic actors (man, woman, boy, girl, person, teen); avoid names and pro- tected attributes as targets. Diversity Requirement (Coverage Grid):Gener- ate exactly120 itemsorganized as10 themes × 12 items, covering distinct everyday contexts (e.g., public transit, workplaces without hierarchy, online spaces, social mixers). Output Format:Return...

work page 2024