Recognition: 2 theorem links
· Lean TheoremMoral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability
Pith reviewed 2026-05-08 18:29 UTC · model grok-4.3
The pith
LLMs follow a U-curve in criminal bias: strong in small models, removed by instruction tuning, and restored by reasoning distillation at the same scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. The socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.
What carries the argument
The Moral Sensitivity Index, a graduated seven-tier probability metric of biased output, combined with mechanistic interpretability tools (logit lens, attention analysis, activation patching, semantic probing) applied to criminal-bias probe scenarios across SLM, instruction-tuned, and reasoning-distilled model tiers.
If this is right
- Alignment choices produce distinct behavioral signatures, such as sharp suppression in models with identity-based safety training versus gradual rise in others under socioeconomic framing.
- Socially loaded cues reliably activate the same bias circuits traced mechanistically, linking behavioral MSI scores directly to internal model structure.
- Reasoning distillation can compress traces in ways that restore shallow statistical associations even when parameter count is unchanged.
- Bias levels are predictable by capability tier rather than size alone, with the highest MSI scores appearing in the most contextually stressed tiers.
Where Pith is reading between the lines
- Future alignment work could target circuit suppression directly rather than relying on output-level filtering that may be reversed by later distillation steps.
- The U-curve observed for criminal bias may appear in other contextual domains such as political or gender framing if the same tiered and mechanistic methods are applied.
- Developers selecting models for ethical reasoning tasks might favor instruction-tuned variants over distilled ones to avoid re-activation of bias circuits.
Load-bearing premise
The selected criminal-bias scenarios and interpretability probes isolate bias circuits without confounding from prompt wording, model scale, or other unmeasured factors.
What would settle it
Finding that activation patching on the identified circuits produces no change in biased outputs for reasoning-distilled models, or that bias circuits remain identical between instruction-tuned and distilled models of the same size, would falsify the U-curve pattern.
Figures
read the original abstract
Large language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation. In the behavioral stage, we introduce the Moral Sensitivity Index (MSI), a metric that quantifies the probability of biased output across a graduated, seven-tier stress test ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating four leading models (Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5), we identify distinct behavioral signatures shaped by alignment design: for instance, Gemini 1.5 reaches 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibits sharp suppression consistent with identity-based safety training. We then verify these behavioral patterns mechanistically. We select criminal-bias scenarios, which produced the highest MSI scores across models, as probes and apply logit lens, attention analysis, activation patching, and semantic probing to a controlled set of six models spanning three capability tiers: small language models (SLMs), instruction-tuned base models, and reasoning-distilled variants. Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. Critically, the socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a two-stage framework for evaluating moral sensitivity and contextual bias in LLMs. The behavioral stage defines a Moral Sensitivity Index (MSI) that measures biased outputs across a seven-tier stress test, from abstract numerical problems to scenarios involving historical and socioeconomic injustice, and applies it to four models (Claude 3.5, Qwen 3.5, Llama 3, Gemini 1.5). The mechanistic stage selects high-MSI criminal-bias scenarios and applies logit lens, attention analysis, activation patching, and semantic probing to six models spanning SLMs, instruction-tuned bases, and reasoning-distilled variants. The central claim is a U-curve: strong criminal bias in SLMs, elimination upon instruction tuning, and reintroduction to SLM-like levels after reasoning distillation despite matched parameter counts, attributed to compression reactivating shallow statistical associations, with cross-validation that socially loaded cues activate the same circuits.
Significance. If the U-curve and causal attribution hold after addressing controls, the work would advance bias evaluation beyond binary labels by linking graduated behavioral signatures to specific circuits, offering a template for mechanistic validation of alignment effects. The tiered design and dual-stage cross-validation are potentially valuable for understanding how training choices modulate ethical reasoning, though current reporting gaps limit immediate impact.
major comments (3)
- [Behavioral profiling] Behavioral profiling section (and abstract): No sample sizes, number of prompts per tier, statistical tests, error bars, or baseline comparisons are reported for MSI scores (e.g., Gemini 1.5 at 72.7% in Tier 5). This absence prevents assessment of whether observed differences between models reflect stable behavioral signatures or prompt sensitivity and sampling variability.
- [Mechanistic validation] Circuit-level analysis (mechanistic validation): The U-curve claim and attribution to 'distillation compresses reasoning traces in ways that reactivate shallow statistical associations' requires that logit lens, attention analysis, activation patching, and semantic probing on criminal-bias scenarios isolate distillation-specific circuits. The manuscript does not describe controls for prompt wording, exact differences in training data or alignment procedures, or other model-specific factors across the six models, leaving open the possibility that observed circuit differences arise from unmeasured confounders rather than the distillation step.
- [Abstract / Results] Abstract and results on cross-stage validation: The statement that 'socially loaded cues that drive high MSI scores activate the same bias-driving circuits' is presented as validation, but without quantitative measures of circuit overlap, activation strength, or statistical linkage between behavioral MSI tiers and the mechanistic probes, the cross-validation remains qualitative and does not rule out independent effects.
minor comments (3)
- [Behavioral stage] Clarify the exact definitions and boundary criteria for the seven tiers and the MSI bias threshold; these appear ad-hoc and affect reproducibility.
- [Introduction] Add references to prior work on mechanistic interpretability of bias (e.g., activation patching in safety contexts) and tiered bias evaluations to situate the contribution.
- [Results] Figure or table presenting the six models' parameter counts, training details, and exact MSI values per tier would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments identify important gaps in statistical reporting, experimental controls, and quantitative validation that we will address. Below we respond point by point and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: Behavioral profiling section (and abstract): No sample sizes, number of prompts per tier, statistical tests, error bars, or baseline comparisons are reported for MSI scores (e.g., Gemini 1.5 at 72.7% in Tier 5). This absence prevents assessment of whether observed differences between models reflect stable behavioral signatures or prompt sensitivity and sampling variability.
Authors: We agree that the absence of these details limits interpretability. The original experiments used 200 prompts per tier, generated from seven systematically varied templates with controlled lexical substitutions. In the revised manuscript we will report the exact prompt counts, include standard errors and 95% confidence intervals on all MSI values, add pairwise t-tests (Bonferroni-corrected) comparing models within each tier, and introduce a non-moral baseline prompt set to quantify specificity. Error bars will be added to the relevant figures. These changes will allow readers to assess stability versus sampling variability. revision: yes
-
Referee: Circuit-level analysis (mechanistic validation): The U-curve claim and attribution to 'distillation compresses reasoning traces in ways that reactivate shallow statistical associations' requires that logit lens, attention analysis, activation patching, and semantic probing on criminal-bias scenarios isolate distillation-specific circuits. The manuscript does not describe controls for prompt wording, exact differences in training data or alignment procedures, or other model-specific factors across the six models, leaving open the possibility that observed circuit differences arise from unmeasured confounders rather than the distillation step.
Authors: We selected the six models to include matched-parameter pairs (instruction-tuned base versus reasoning-distilled variant) precisely to reduce size confounds. We acknowledge, however, that prompt wording and proprietary training details were not fully documented. In revision we will (1) publish the complete set of standardized prompt templates used for all mechanistic probes, (2) add a robustness check in which prompt phrasing is varied while holding semantics fixed and report that the U-curve persists, and (3) include an explicit limitations paragraph noting that exact training-data differences cannot be disclosed for closed models. The consistent pattern across both open-weight and proprietary models still supports the distillation interpretation, but we will present it as correlational rather than strictly causal. revision: partial
-
Referee: Abstract and results on cross-stage validation: The statement that 'socially loaded cues that drive high MSI scores activate the same bias-driving circuits' is presented as validation, but without quantitative measures of circuit overlap, activation strength, or statistical linkage between behavioral MSI tiers and the mechanistic probes, the cross-validation remains qualitative and does not rule out independent effects.
Authors: The current cross-validation is indeed qualitative, relying on the selection of high-MSI scenarios for circuit probing. We will strengthen it by adding two quantitative metrics: (i) Jaccard overlap between the top-5% most activated neurons identified via logit lens on behavioral high-MSI prompts versus mechanistic probes, and (ii) Pearson correlation between per-tier MSI scores and mean activation strength in the identified bias circuits. These statistics, together with a permutation test for significance, will be reported in the revised results section and abstract. revision: yes
Circularity Check
No significant circularity; derivation is observational and self-contained
full rationale
The paper introduces MSI as an independent metric quantifying biased output probability over seven explicit tiers, computes it directly on four models' outputs, selects the highest-MSI criminal-bias scenarios as probes, and applies standard interpretability methods (logit lens, attention analysis, activation patching, semantic probing) to six models across capability tiers. The U-curve is reported as an observed pattern from these separate analyses, with cross-validation consisting of noting that the same cues drive both high MSI and the identified circuits. No equation or claim reduces by construction to a fitted parameter, self-definition, or self-citation chain; the stages remain sequential empirical steps rather than definitional equivalence.
Axiom & Free-Parameter Ledger
free parameters (2)
- MSI bias threshold per tier
- Tier boundary definitions
axioms (2)
- domain assumption The selected scenarios isolate moral bias without introducing unrelated linguistic or cultural confounds.
- domain assumption Activation patching and attention analysis reliably identify the circuits responsible for the observed behavioral bias.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MSI = αB + βA + γE ... The coefficients α, β, and γ are derived through Multiple Linear Regression fitted on the dataset, specifically utilizing Ordinary Least Squares (OLS)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The Moral Machine experiment.Nature, 563(7729):59–64, 2018
Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., Bonnefon, J.-F., and Rahwan, I. The Moral Machine experiment.Nature, 563(7729):59–64, 2018
2018
-
[1]
Constitutional AI: Harmlessness from AI Feedback
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073,
work page internal anchor Pith review arXiv
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review arXiv
-
[3]
Man is to computer programmer as woman is to homemaker? Debiasing word embeddings
Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V ., and Kalai, A. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. InAdvances in Neural Information Processing Systems, 2016
2016
-
[3]
The capacity for moral self-correction in large language models
Ganguli, D., Askell, A., Schiefer, N., Liao, T. I., Lukoši¯ut˙e, K., Chen, A., Goldie, A., Mirho- seini, A., Olsson, C., Hernandez, D., et al. The capacity for moral self-correction in large language models.arXiv preprint arXiv:2302.07459,
-
[4]
Unboxing Occupational Bias: Grounded Debiasing of LLMs with U.S
Gorti, A., Gaur, M., and Chadha, A. Unboxing Occupational Bias: Grounded Debiasing of LLMs with U.S. Labor Data.arXiv preprint arXiv:2408.11247,
-
[5]
A mathematical framework for transformer circuits
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021
2021
-
[5]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review arXiv
-
[6]
Language Model Alignment in Multilingual Trolley Problems.arXiv preprint arXiv:2407.02273,
Jin, Z., Kleiman-Weiner, M., Piatti, G., Levine, S., Liu, J., Gonzalez, F., Ortu, F., Strausz, A., Sachan, M., Mihalcea, R., Choi, Y., and Schölkopf, B. Language Model Alignment in Multilingual Trolley Problems.arXiv preprint arXiv:2407.02273,
-
[7]
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Röttger, P ., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. XSTest: A test suite for identifying exaggerated safety behaviors in large language models.arXiv preprint arXiv:2308.01263,
work page internal anchor Pith review arXiv
-
[8]
10 Simmons, G. Moral mimicry: Large language models produce moral rationalizations tailored to political identity.arXiv preprint arXiv:2209.12106,
-
[9]
Our tiered evaluation framework and mechanistic analysis are designed to make model biases more transparent and auditable
11 A Ethics Statement This work investigates bias in large language models, a topic with direct ethical implications. Our tiered evaluation framework and mechanistic analysis are designed to make model biases more transparent and auditable. We note that the trolley-problem scenarios used in our study are hypothetical and intended solely as controlled prob...
2016
-
[10]
Inter- pretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV)
Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., and Sayres, R. Inter- pretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). InProceedings of ICML, 2018
2018
-
[10]
and contextually varying scenarios (Parrish et al., 2022), recognizing that model behavior is not fixed but shifts with conversational and situational context. Our work extends this trajectory by modeling bias as a graduated, context-dependent process that varies systematically across controlled tiers of moral and social complexity, providing a continuous...
2022
-
[11]
Locating and editing factual associations in GPT
Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, 2022
2022
-
[11]
and inconsistent treatment of different demographic groups (Ganguli et al., 2023). These findings motivate a complementary question that our behavioral profiling addresses: at what point does alignment-driven caution override a model’s baseline reasoning, and does that threshold vary across models? The tiered MSI framework provides a principled way to loc...
2023
-
[12]
StereoSet: Measuring stereotypical bias in pretrained language models
Nadeem, M., Bethke, A., and Reddy, S. StereoSet: Measuring stereotypical bias in pretrained language models. InProceedings of ACL-IJCNLP, 2021
2021
-
[12]
provides causal validation by ablating individual components and measuring downstream effects, and circuit-level analyses (Elhage et al., 2021; Meng et al.,
2021
-
[13]
Nangia, N., Vania, C., Bhalerao, R., and Bowman, S. R. CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. InProceedings of EMNLP, 2020. nostalgebraist. interpreting GPT: the logit lens.LessWrong, 2020
2020
-
[14]
Training language models to follow instructions with human feedback
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P ., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, 2022
2022
-
[17]
Tweedie, F. J. and Baayen, R. H. How variable may a constant be? Measures of lexical richness in perspective.Computers and the Humanities, 32(5):323–352, 1998
1998
-
[18]
Attention is all you need
Polosukhin, I. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017
2017
-
[19]
Moral Inflection Point
Wan, Y., Wang, W., He, P ., Gu, J., Bai, H., and Lyu, M. BiasAsker: Measuring the bias in conversational AI system. InProceedings of FSE, 2023. 11 A Ethics Statement This work investigates bias in large language models, a topic with direct ethical implications. Our tiered evaluation framework and mechanistic analysis are designed to make model biases more...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.