Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability
Pith reviewed 2026-05-08 18:29 UTC · model grok-4.3
The pith
LLMs follow a U-curve in criminal bias: strong in small models, removed by instruction tuning, and restored by reasoning distillation at the same scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. The socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.
What carries the argument
The Moral Sensitivity Index, a graduated seven-tier probability metric of biased output, combined with mechanistic interpretability tools (logit lens, attention analysis, activation patching, semantic probing) applied to criminal-bias probe scenarios across SLM, instruction-tuned, and reasoning-distilled model tiers.
If this is right
- Alignment choices produce distinct behavioral signatures, such as sharp suppression in models with identity-based safety training versus gradual rise in others under socioeconomic framing.
- Socially loaded cues reliably activate the same bias circuits traced mechanistically, linking behavioral MSI scores directly to internal model structure.
- Reasoning distillation can compress traces in ways that restore shallow statistical associations even when parameter count is unchanged.
- Bias levels are predictable by capability tier rather than size alone, with the highest MSI scores appearing in the most contextually stressed tiers.
Where Pith is reading between the lines
- Future alignment work could target circuit suppression directly rather than relying on output-level filtering that may be reversed by later distillation steps.
- The U-curve observed for criminal bias may appear in other contextual domains such as political or gender framing if the same tiered and mechanistic methods are applied.
- Developers selecting models for ethical reasoning tasks might favor instruction-tuned variants over distilled ones to avoid re-activation of bias circuits.
Load-bearing premise
The selected criminal-bias scenarios and interpretability probes isolate bias circuits without confounding from prompt wording, model scale, or other unmeasured factors.
What would settle it
Finding that activation patching on the identified circuits produces no change in biased outputs for reasoning-distilled models, or that bias circuits remain identical between instruction-tuned and distilled models of the same size, would falsify the U-curve pattern.
Figures
read the original abstract
Large language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation. In the behavioral stage, we introduce the Moral Sensitivity Index (MSI), a metric that quantifies the probability of biased output across a graduated, seven-tier stress test ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating four leading models (Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5), we identify distinct behavioral signatures shaped by alignment design: for instance, Gemini 1.5 reaches 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibits sharp suppression consistent with identity-based safety training. We then verify these behavioral patterns mechanistically. We select criminal-bias scenarios, which produced the highest MSI scores across models, as probes and apply logit lens, attention analysis, activation patching, and semantic probing to a controlled set of six models spanning three capability tiers: small language models (SLMs), instruction-tuned base models, and reasoning-distilled variants. Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. Critically, the socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a two-stage framework for evaluating moral sensitivity and contextual bias in LLMs. The behavioral stage defines a Moral Sensitivity Index (MSI) that measures biased outputs across a seven-tier stress test, from abstract numerical problems to scenarios involving historical and socioeconomic injustice, and applies it to four models (Claude 3.5, Qwen 3.5, Llama 3, Gemini 1.5). The mechanistic stage selects high-MSI criminal-bias scenarios and applies logit lens, attention analysis, activation patching, and semantic probing to six models spanning SLMs, instruction-tuned bases, and reasoning-distilled variants. The central claim is a U-curve: strong criminal bias in SLMs, elimination upon instruction tuning, and reintroduction to SLM-like levels after reasoning distillation despite matched parameter counts, attributed to compression reactivating shallow statistical associations, with cross-validation that socially loaded cues activate the same circuits.
Significance. If the U-curve and causal attribution hold after addressing controls, the work would advance bias evaluation beyond binary labels by linking graduated behavioral signatures to specific circuits, offering a template for mechanistic validation of alignment effects. The tiered design and dual-stage cross-validation are potentially valuable for understanding how training choices modulate ethical reasoning, though current reporting gaps limit immediate impact.
major comments (3)
- [Behavioral profiling] Behavioral profiling section (and abstract): No sample sizes, number of prompts per tier, statistical tests, error bars, or baseline comparisons are reported for MSI scores (e.g., Gemini 1.5 at 72.7% in Tier 5). This absence prevents assessment of whether observed differences between models reflect stable behavioral signatures or prompt sensitivity and sampling variability.
- [Mechanistic validation] Circuit-level analysis (mechanistic validation): The U-curve claim and attribution to 'distillation compresses reasoning traces in ways that reactivate shallow statistical associations' requires that logit lens, attention analysis, activation patching, and semantic probing on criminal-bias scenarios isolate distillation-specific circuits. The manuscript does not describe controls for prompt wording, exact differences in training data or alignment procedures, or other model-specific factors across the six models, leaving open the possibility that observed circuit differences arise from unmeasured confounders rather than the distillation step.
- [Abstract / Results] Abstract and results on cross-stage validation: The statement that 'socially loaded cues that drive high MSI scores activate the same bias-driving circuits' is presented as validation, but without quantitative measures of circuit overlap, activation strength, or statistical linkage between behavioral MSI tiers and the mechanistic probes, the cross-validation remains qualitative and does not rule out independent effects.
minor comments (3)
- [Behavioral stage] Clarify the exact definitions and boundary criteria for the seven tiers and the MSI bias threshold; these appear ad-hoc and affect reproducibility.
- [Introduction] Add references to prior work on mechanistic interpretability of bias (e.g., activation patching in safety contexts) and tiered bias evaluations to situate the contribution.
- [Results] Figure or table presenting the six models' parameter counts, training details, and exact MSI values per tier would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments identify important gaps in statistical reporting, experimental controls, and quantitative validation that we will address. Below we respond point by point and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: Behavioral profiling section (and abstract): No sample sizes, number of prompts per tier, statistical tests, error bars, or baseline comparisons are reported for MSI scores (e.g., Gemini 1.5 at 72.7% in Tier 5). This absence prevents assessment of whether observed differences between models reflect stable behavioral signatures or prompt sensitivity and sampling variability.
Authors: We agree that the absence of these details limits interpretability. The original experiments used 200 prompts per tier, generated from seven systematically varied templates with controlled lexical substitutions. In the revised manuscript we will report the exact prompt counts, include standard errors and 95% confidence intervals on all MSI values, add pairwise t-tests (Bonferroni-corrected) comparing models within each tier, and introduce a non-moral baseline prompt set to quantify specificity. Error bars will be added to the relevant figures. These changes will allow readers to assess stability versus sampling variability. revision: yes
-
Referee: Circuit-level analysis (mechanistic validation): The U-curve claim and attribution to 'distillation compresses reasoning traces in ways that reactivate shallow statistical associations' requires that logit lens, attention analysis, activation patching, and semantic probing on criminal-bias scenarios isolate distillation-specific circuits. The manuscript does not describe controls for prompt wording, exact differences in training data or alignment procedures, or other model-specific factors across the six models, leaving open the possibility that observed circuit differences arise from unmeasured confounders rather than the distillation step.
Authors: We selected the six models to include matched-parameter pairs (instruction-tuned base versus reasoning-distilled variant) precisely to reduce size confounds. We acknowledge, however, that prompt wording and proprietary training details were not fully documented. In revision we will (1) publish the complete set of standardized prompt templates used for all mechanistic probes, (2) add a robustness check in which prompt phrasing is varied while holding semantics fixed and report that the U-curve persists, and (3) include an explicit limitations paragraph noting that exact training-data differences cannot be disclosed for closed models. The consistent pattern across both open-weight and proprietary models still supports the distillation interpretation, but we will present it as correlational rather than strictly causal. revision: partial
-
Referee: Abstract and results on cross-stage validation: The statement that 'socially loaded cues that drive high MSI scores activate the same bias-driving circuits' is presented as validation, but without quantitative measures of circuit overlap, activation strength, or statistical linkage between behavioral MSI tiers and the mechanistic probes, the cross-validation remains qualitative and does not rule out independent effects.
Authors: The current cross-validation is indeed qualitative, relying on the selection of high-MSI scenarios for circuit probing. We will strengthen it by adding two quantitative metrics: (i) Jaccard overlap between the top-5% most activated neurons identified via logit lens on behavioral high-MSI prompts versus mechanistic probes, and (ii) Pearson correlation between per-tier MSI scores and mean activation strength in the identified bias circuits. These statistics, together with a permutation test for significance, will be reported in the revised results section and abstract. revision: yes
Circularity Check
No significant circularity; derivation is observational and self-contained
full rationale
The paper introduces MSI as an independent metric quantifying biased output probability over seven explicit tiers, computes it directly on four models' outputs, selects the highest-MSI criminal-bias scenarios as probes, and applies standard interpretability methods (logit lens, attention analysis, activation patching, semantic probing) to six models across capability tiers. The U-curve is reported as an observed pattern from these separate analyses, with cross-validation consisting of noting that the same cues drive both high MSI and the identified circuits. No equation or claim reduces by construction to a fitted parameter, self-definition, or self-citation chain; the stages remain sequential empirical steps rather than definitional equivalence.
Axiom & Free-Parameter Ledger
free parameters (2)
- MSI bias threshold per tier
- Tier boundary definitions
axioms (2)
- domain assumption The selected scenarios isolate moral bias without introducing unrelated linguistic or cultural confounds.
- domain assumption Activation patching and attention analysis reliably identify the circuits responsible for the observed behavioral bias.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MSI = αB + βA + γE ... The coefficients α, β, and γ are derived through Multiple Linear Regression fitted on the dataset, specifically utilizing Ordinary Least Squares (OLS)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.