Recognition: 2 theorem links
· Lean TheoremTracing Moral Foundations in Large Language Models
Pith reviewed 2026-05-16 16:59 UTC · model grok-4.3
The pith
Large language models develop internal representations of moral foundations that align with human judgments and emerge naturally during pretraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models represent and distinguish moral foundations in a manner that aligns with human judgments, and this moral geometry naturally emerges from pretraining and is selectively rewired by post-training. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs.
What carries the argument
Moral Foundations Theory framework applied through layer-wise representation analysis, pretrained sparse autoencoders on the residual stream, and causal steering interventions with both dense vectors and sparse features.
If this is right
- Moral concepts appear distributed across layers and partly disentangled within shared representations.
- The moral geometry forms as a latent pattern from statistical regularities in language data alone.
- Post-training selectively rewires existing moral representations rather than building them from scratch.
- Causal links exist between identified internal features and observable moral behavior in model outputs.
Where Pith is reading between the lines
- Editing moral outputs could be achieved by targeted vector or feature steering instead of full retraining.
- Similar latent structures might appear for other value systems or ethical frameworks not covered by Moral Foundations Theory.
- Training data composition could directly shape which moral distinctions models prioritize, creating measurable biases.
- Cross-model comparisons might reveal whether moral geometry scales consistently with parameter count or architecture.
Load-bearing premise
Moral Foundations Theory categories and SAE features accurately reflect the models' genuine internal moral concepts rather than matching surface patterns or imposed external labels.
What would settle it
Steering interventions using the identified dense vectors or sparse SAE features fail to produce consistent, predictable shifts in the models' moral judgments or foundation-relevant outputs.
Figures
read the original abstract
Large language models often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed across 14 base and instruction-tuned LLMs spanning four model families (Llama, Qwen2.5, Qwen3-MoE, Mistral) and scales from 7B to 70B. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that models represent and distinguish moral foundations in a manner that aligns with human judgments, and that this moral geometry naturally emerges from pretraining and is selectively rewired by post-training. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs represent and distinguish moral foundations from Moral Foundations Theory (MFT) in a manner aligning with human judgments. This moral geometry emerges naturally from pretraining, is selectively rewired by post-training, and is supported by sparse SAE features with semantic links to specific foundations. Causal steering via dense MFT vectors or sparse SAE features produces predictable shifts in foundation-relevant behavior, providing mechanistic evidence that moral concepts are distributed, layered, and partly disentangled across 14 models from Llama, Qwen, and Mistral families.
Significance. If the results hold, the work supplies mechanistic evidence that moral concepts in LLMs arise as latent patterns from language statistics rather than pure mimicry. The multi-level pipeline—layer-wise analysis, pretrained SAEs, and causal steering—strengthens interpretability claims by linking representations to behavior. This advances understanding of how pluralistic moral structures can emerge in models and has implications for alignment research.
major comments (2)
- [Methods and §3] Methods and §3 (Layer-wise Analysis): The central claim that moral geometry 'naturally emerges from pretraining' and aligns with human judgments rests on MFT-defined stimuli and post-hoc feature labeling. Without controls using alternative taxonomies or fully unsupervised discovery of moral directions, the alignment metrics and semantic links risk being driven by the analytic framework rather than intrinsic residual-stream structure. This is load-bearing for distinguishing faithful capture from projection.
- [§5] §5 (Steering Interventions): Steering along dense vectors or sparse SAE features produces behavioral shifts, but the evaluation uses the same MFT probes as the representation analysis. No ablation isolates whether the vectors/features encode the claimed foundations independently of those probes, weakening the causal connection between internal representations and moral outputs.
minor comments (2)
- [Figure 2] Figure 2: Add statistical significance markers or confidence intervals to the layer-wise alignment plots to allow readers to assess whether reported correlations exceed chance levels.
- [§2] The term 'moral geometry' is used throughout without a precise operational definition (e.g., in terms of cosine distances or activation subspaces); a short formalization in §2 would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important considerations for strengthening our claims about moral representations in LLMs. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Methods and §3] Methods and §3 (Layer-wise Analysis): The central claim that moral geometry 'naturally emerges from pretraining' and aligns with human judgments rests on MFT-defined stimuli and post-hoc feature labeling. Without controls using alternative taxonomies or fully unsupervised discovery of moral directions, the alignment metrics and semantic links risk being driven by the analytic framework rather than intrinsic residual-stream structure. This is load-bearing for distinguishing faithful capture from projection.
Authors: We agree that grounding the analysis in MFT stimuli creates a risk of framework-driven results rather than purely intrinsic structure. The alignment metrics rely on external human MFT survey data for validation, and emergence from pretraining is supported by comparative layer-wise patterns between base and post-trained models. To address the concern directly, we will add a dedicated limitations subsection in §3 discussing the choice of MFT versus alternative taxonomies (e.g., Schwartz values) and include new controls such as random direction baselines and shuffled-label ablations to test specificity. Fully unsupervised discovery of moral directions would require substantial new experiments beyond the current scope, but the added controls and discussion will mitigate projection risks. This is a partial revision. revision: partial
-
Referee: [§5] §5 (Steering Interventions): Steering along dense vectors or sparse SAE features produces behavioral shifts, but the evaluation uses the same MFT probes as the representation analysis. No ablation isolates whether the vectors/features encode the claimed foundations independently of those probes, weakening the causal connection between internal representations and moral outputs.
Authors: We acknowledge that reusing the same MFT probes for both representation analysis and steering evaluation limits the independence of the causal evidence. To strengthen this link, we will incorporate new ablations in the revised §5: (i) testing steering vectors on novel moral scenarios and dilemmas not present in the original probes, and (ii) adding non-moral control steering vectors (e.g., from factual or sentiment tasks) to demonstrate specificity of foundation-relevant shifts. These additions will be supported by quantitative metrics comparing effect sizes. This addresses the core concern without requiring a full redesign of the pipeline and constitutes a partial revision. revision: partial
Circularity Check
No significant circularity; claims rest on independent empirical measurements
full rationale
The paper's derivation chain consists of layer-wise activation analysis, pretrained SAE feature extraction, and causal steering interventions, all evaluated against external human judgment benchmarks. These steps produce measurable outputs (alignment scores, semantic links, behavioral shifts) that are not equivalent to the input stimuli or MFT categories by construction. MFT serves as an external analytic lens drawn from established psychological literature rather than a self-defined or author-derived ansatz. No equations reduce fitted parameters to predictions, no self-citation chains bear the central claims, and no uniqueness theorems or renamings are invoked. The results remain falsifiable via alternative taxonomies or unsupervised methods, keeping the analysis self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Moral Foundations Theory provides a valid, cross-culturally applicable decomposition of human moral concepts.
- domain assumption Pretrained sparse autoencoders on residual streams extract interpretable, disentangled features corresponding to semantic concepts.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded... layer-wise analysis of MFT concept representations... pretrained sparse autoencoders (SAEs)... causal steering interventions
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we demonstrate a robust representational alignment between LLM latent spaces and human moral perceptions... geometric separability provides computational support for pluralist theories
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Psychological Steering of Large Language Models
Mean-difference residual stream injections outperform personality prompting for OCEAN trait steering in most LLMs, with hybrids performing best and showing approximate linearity but non-human trait covariances.
Reference graph
Works this paper leans on
-
[1]
Moral foundations of large language models. InProceedings of the 2024 Conference on Empiri- cal Methods in Natural Language Processing, pages 17737–17752. Suhaib Abdurahman, Mohammad Atari, Farzan Karimi- Malekabadi, Mona J Xue, Jackson Trager, Peter S Park, Preni Golazizian, Ali Omrani, and Morteza De- hghani. 2024. Perils and opportunities in using larg...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Psychological steering in llms: An evalua- tion of effectiveness and trustworthiness.Preprint, arXiv:2510.04484. Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber- Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. 2025. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms.Preprint, arXiv:2502.17424. Joseph Bloom, C...
-
[3]
Moral foundations theory: The pragmatic va- lidity of moral pluralism. InAdvances in experi- mental social psychology, volume 47, pages 55–130. Elsevier. Jesse Graham, Brian A Nosek, Jonathan Haidt, Ravi Iyer, Spassena Koleva, and Peter H Ditto. 2011. Map- ping the moral domain.Journal of personality and social psychology, 101(2):366. Aaron Grattafiori, A...
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[4]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300. Joseph Henrich, Steven J Heine, and Ara Norenzayan
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[5]
The weirdest people in the world?Behavioral and brain sciences, 33(2-3):61–83. Joe Hoover, Mohammad Atari, Aida Mostafazadeh Da- vani, Brendan Kennedy, Gwenyth Portillo-Wightman, Leigh Yeh, and Morteza Dehghani. 2021. Investi- gating the role of group-based morality in extreme behavioral expressions of prejudice.Nature Commu- nications, 12(1):4585. Joe Ho...
work page 2021
-
[6]
Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, and Yongfeng Zhang
Moral foundations elicit shared and dissocia- ble cortical activation modulated by political ideol- ogy.Nature Human Behaviour, 7(12):2182–2198. Jianchao Ji, Yutong Chen, Mingyu Jin, Wujiang Xu, Wenyue Hua, and Yongfeng Zhang. 2025. Moral- bench: Moral evaluation of llms.ACM SIGKDD Explorations Newsletter, 27(1):62–71. Behnam Karami, Fatemeh Zandi, and Ja...
work page 2025
-
[7]
Emergent moral representations in large lan- guage models aligns with human conceptual, neural, and behavioral moral structure.Research Square Preprint. Adam Karvonen. 2024. An intuitive explanation of sparse autoencoders for llm interpretability. Brendan Kennedy, Mohammad Atari, Aida Mostafazadeh Davani, Joe Hoover, Ali Omrani, Jesse Graham, and Morteza Dehghani
work page 2024
-
[8]
Moral concerns are differentially observable in language.Cognition, 212:104696. Ari Khoudary, Eleanor Hanna, Kevin O’Neill, Vi- jeth Iyengar, Scott Clifford, Roberto Cabeza, Felipe De Brigard, and Walter Sinnott-Armstrong. 2022. A functional neuroimaging investigation of moral foun- dations theory.Social Neuroscience, 17(6):491–507. Been Kim, Martin Watte...
work page 2022
-
[9]
Dictionary learning. GitHub repository. https://github.com/saprmarks/dictionary_ learning. Meta. 2024. meta-llama/llama-3.1-8b-instruct. https://huggingface.co/meta-llama/Llama-3. 1-8B-Instruct. Hugging Face model. Accessed: 2025-12-31. 11 Richard Ngo, Lawrence Chan, and Sören Mindermann
work page 2024
-
[10]
The alignment problem from a deep learning perspective. InThe Twelfth International Conference on Learning Representations. José Luiz Nunes, Guilherme F. C. F. Almeida, Marcelo de Araujo, and Simone D. J. Barbosa. 2024. Are large language models moral hypocrites? a study based on moral foundations.arXiv preprint arXiv:2409.01955. Dino Pedreschi, Fosca Gia...
-
[11]
Neutral Description First:Describe the dom- inant pattern (topic, style, rhetorical function, or social behavior) neutrally
-
[12]
Otherwise, output mft_alignment="none"
Conservative MFT Mapping:Map to a Moral Foundations Theory categoryonly if strongly supported. Otherwise, output mft_alignment="none". Do not force moral- ity; many features are not moral
-
[13]
Format:Provide a short label (5–10 words) and a 1–2 sentence description
-
[14]
Citations:Cite evidence_ids (indices of snippets) that justify your decision. Moral Foundations Theory (MFT) definitions: • Care/harm:dislike others’ suffering; kind- ness, gentleness, nurturance vs cruelty, vio- lence. • Fairness/cheating:justice, rights, autonomy vs fraud, exploitation, cheating. • Loyalty/betrayal:group allegiance, patrio- tism, self-s...
work page 2020
-
[15]
Form:Each item must be a single sentence (≤25 words), plain language, observational tone
-
[16]
Content Constraints:Emotional harm only; no physical harm, threats, authority roles, or in-group/out-group dynamics
-
[17]
Social Context:Use strangers or minimal rela- tionships; avoid family, close friends, or hierar- chical roles
- [18]
-
[19]
Child labor has no place in the production of
Subjects:Use generic actors (man, woman, boy, girl, person, teen); avoid names and pro- tected attributes as targets. Diversity Requirement (Coverage Grid):Gener- ate exactly120 itemsorganized as10 themes × 12 items, covering distinct everyday contexts (e.g., public transit, workplaces without hierarchy, online spaces, social mixers). Output Format:Return...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.