{"paper":{"title":"Tracing Moral Foundations in Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Large language models develop internal representations of moral foundations that align with human judgments and emerge naturally during pretraining.","cross_cats":["cs.AI"],"primary_cat":"cs.CL","authors_text":"Bowen Yi, Chenxiao Yu, Farzan Karimi-Malekabadi, Jinyi Ye, Morteza Dehghani, Shrikanth Narayanan, Suhaib Abdurahman, Yue Zhao","submitted_at":"2026-01-09T00:09:28Z","abstract_excerpt":"Large language models often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed across 14 base and instruction-tuned LLMs spanning four model families (Llama, Qwen2.5, Qwen3-MoE, Mistral) and scales from 7B to 70B. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretraine"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Models represent and distinguish moral foundations in a manner that aligns with human judgments, and this moral geometry naturally emerges from pretraining and is selectively rewired by post-training; steering along dense vectors or sparse SAE features produces predictable shifts in foundation-relevant behavior.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the chosen Moral Foundations Theory categories and the SAE feature extraction faithfully capture the models' internal moral concepts rather than imposing an external taxonomy or detecting spurious correlations.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Moral foundations in LLMs form distributed, layered representations that align with human perceptions, emerge from pretraining, and causally influence outputs when steered via dense vectors or sparse features.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Large language models develop internal representations of moral foundations that align with human judgments and emerge naturally during pretraining.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b64b2bfc9bedb53cfe0a377193f17ac33089468fb77d279909fb0fe5cad252e0"},"source":{"id":"2601.05437","kind":"arxiv","version":3},"verdict":{"id":"b05af869-480c-4c50-9d0e-3898f96bbcaa","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T16:59:48.755039Z","strongest_claim":"Models represent and distinguish moral foundations in a manner that aligns with human judgments, and this moral geometry naturally emerges from pretraining and is selectively rewired by post-training; steering along dense vectors or sparse SAE features produces predictable shifts in foundation-relevant behavior.","one_line_summary":"Moral foundations in LLMs form distributed, layered representations that align with human perceptions, emerge from pretraining, and causally influence outputs when steered via dense vectors or sparse features.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the chosen Moral Foundations Theory categories and the SAE feature extraction faithfully capture the models' internal moral concepts rather than imposing an external taxonomy or detecting spurious correlations.","pith_extraction_headline":"Large language models develop internal representations of moral foundations that align with human judgments and emerge naturally during pretraining."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2601.05437/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"7c493822b549d0aa63e919326c21f4034b258aa9969bdaf706a14b8f38994438"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}