arxiv: 2605.03217 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.CY

Recognition: 2 theorem links

· Lean Theorem

Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

Yash Aggarwal , Atmika Gorti , Vinija Jain , Aman Chadha , Krishnaprasad Thirunarayan , Manas Gaur

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:29 UTC · model grok-4.3

classification 💻 cs.LG cs.CY

keywords Moral Sensitivity IndexLLM bias evaluationmechanistic interpretabilitycontextual biasreasoning distillationU-curve biascriminal biastiered stress test

0 comments

The pith

LLMs follow a U-curve in criminal bias: strong in small models, removed by instruction tuning, and restored by reasoning distillation at the same scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Moral Sensitivity Index to measure gradual bias emergence in LLMs across seven stress tiers from abstract problems to socioeconomic injustice scenarios. Behavioral tests on models like Claude, Gemini, Llama, and Qwen show alignment-specific patterns, with some reaching high bias under loaded framing. Mechanistic probes on the highest-bias criminal scenarios then trace the pattern to circuits: small models activate bias strongly, instruction-tuned versions suppress it, and reasoning-distilled models re-activate similar shallow associations despite matching parameter counts. Cross-validation confirms that the cues producing high index scores also drive the identified circuits.

Core claim

Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. The socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.

What carries the argument

The Moral Sensitivity Index, a graduated seven-tier probability metric of biased output, combined with mechanistic interpretability tools (logit lens, attention analysis, activation patching, semantic probing) applied to criminal-bias probe scenarios across SLM, instruction-tuned, and reasoning-distilled model tiers.

If this is right

Alignment choices produce distinct behavioral signatures, such as sharp suppression in models with identity-based safety training versus gradual rise in others under socioeconomic framing.
Socially loaded cues reliably activate the same bias circuits traced mechanistically, linking behavioral MSI scores directly to internal model structure.
Reasoning distillation can compress traces in ways that restore shallow statistical associations even when parameter count is unchanged.
Bias levels are predictable by capability tier rather than size alone, with the highest MSI scores appearing in the most contextually stressed tiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future alignment work could target circuit suppression directly rather than relying on output-level filtering that may be reversed by later distillation steps.
The U-curve observed for criminal bias may appear in other contextual domains such as political or gender framing if the same tiered and mechanistic methods are applied.
Developers selecting models for ethical reasoning tasks might favor instruction-tuned variants over distilled ones to avoid re-activation of bias circuits.

Load-bearing premise

The selected criminal-bias scenarios and interpretability probes isolate bias circuits without confounding from prompt wording, model scale, or other unmeasured factors.

What would settle it

Finding that activation patching on the identified circuits produces no change in biased outputs for reasoning-distilled models, or that bias circuits remain identical between instruction-tuned and distilled models of the same size, would falsify the U-curve pattern.

Figures

Figures reproduced from arXiv: 2605.03217 by Aman Chadha, Atmika Gorti, Krishnaprasad Thirunarayan, Manas Gaur, Vinija Jain, Yash Aggarwal.

**Figure 1.** Figure 1: Layer-by-layer decision trajectories across model tiers: Base Models (7–8B, left), view at source ↗

**Figure 2.** Figure 2: Attention differential for the top 3 criminal-focus and non-criminal-focus heads. Left: Llama (base vs. distilled). Right: Qwen (base vs. distilled). In both families, distillation restructures broad, mid-layer attention patterns into highly localised early/late drivers and strong inhibitors. In both families, distillation reorganizes the relevant attention pattern rather than simply scaling its magnitude:… view at source ↗

**Figure 3.** Figure 3: Semantic valence projection across model families. In both the Llama and Qwen families, reasoning distillation compresses the separation between neutral concepts and the criminal valence axis. In the Qwen family, this compression is severe enough that neutral concepts such as Scientist and Citizen cross into positive criminal valence. details in Appendix H). The OV circuit reconstruction provides qualitati… view at source ↗

**Figure 4.** Figure 4: Bias Rate vs. Lexical Diversity (LD) for Claude across seven tiers. view at source ↗

**Figure 5.** Figure 5: Judgment distribution for Gemini across seven tiers, showing the relative propor view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation. In the behavioral stage, we introduce the Moral Sensitivity Index (MSI), a metric that quantifies the probability of biased output across a graduated, seven-tier stress test ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating four leading models (Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5), we identify distinct behavioral signatures shaped by alignment design: for instance, Gemini 1.5 reaches 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibits sharp suppression consistent with identity-based safety training. We then verify these behavioral patterns mechanistically. We select criminal-bias scenarios, which produced the highest MSI scores across models, as probes and apply logit lens, attention analysis, activation patching, and semantic probing to a controlled set of six models spanning three capability tiers: small language models (SLMs), instruction-tuned base models, and reasoning-distilled variants. Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. Critically, the socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The MSI tiers and U-curve observation are the real additions, but the mechanistic claim needs tighter controls to hold.

read the letter

The paper's clearest new pieces are the Moral Sensitivity Index with its seven graduated tiers and the reported U-curve where criminal bias drops with instruction tuning then returns after reasoning distillation at matched scale. That pattern, plus the cross-check that socially loaded prompts light up the same circuits found by logit lens and patching, is what stands out from the abstract and methods sketch. They also show concrete differences across models, like Gemini hitting high MSI under socioeconomic framing while Claude stays suppressed, which gives a practical way to talk about context-sensitive bias instead of binary labels. The attempt to link behavior directly to circuit reactivation is a step beyond most bias papers that stop at outputs. The work is honest about using criminal-bias scenarios as the probe because they produced the strongest signals. That choice makes sense for testing the claim. The main soft spots are the missing numbers. No sample sizes, no error bars, no statistical tests, and no explicit controls for prompt wording or training-data overlap appear in the description, so the 72.7% figure and the exact U-curve shape are hard to judge for robustness. The six-model comparison also mixes families, so it is not obvious that distillation alone drives the reappearance of bias rather than other alignment differences. If the full paper has matched pairs and ablation runs that isolate the compression effect, that concern shrinks; otherwise the causal story stays suggestive. This is useful for AI safety and deployment teams that need more than pass/fail bias checks. A reader already working on mechanistic interpretability or ethical evaluation frameworks will find the tier structure and the circuit reactivation idea worth testing in their own setups. I would send it for peer review. The core framing is worth referee time to push on the controls and statistics, even if the current evidence is preliminary.

Referee Report

3 major / 3 minor

Summary. The paper introduces a two-stage framework for evaluating moral sensitivity and contextual bias in LLMs. The behavioral stage defines a Moral Sensitivity Index (MSI) that measures biased outputs across a seven-tier stress test, from abstract numerical problems to scenarios involving historical and socioeconomic injustice, and applies it to four models (Claude 3.5, Qwen 3.5, Llama 3, Gemini 1.5). The mechanistic stage selects high-MSI criminal-bias scenarios and applies logit lens, attention analysis, activation patching, and semantic probing to six models spanning SLMs, instruction-tuned bases, and reasoning-distilled variants. The central claim is a U-curve: strong criminal bias in SLMs, elimination upon instruction tuning, and reintroduction to SLM-like levels after reasoning distillation despite matched parameter counts, attributed to compression reactivating shallow statistical associations, with cross-validation that socially loaded cues activate the same circuits.

Significance. If the U-curve and causal attribution hold after addressing controls, the work would advance bias evaluation beyond binary labels by linking graduated behavioral signatures to specific circuits, offering a template for mechanistic validation of alignment effects. The tiered design and dual-stage cross-validation are potentially valuable for understanding how training choices modulate ethical reasoning, though current reporting gaps limit immediate impact.

major comments (3)

[Behavioral profiling] Behavioral profiling section (and abstract): No sample sizes, number of prompts per tier, statistical tests, error bars, or baseline comparisons are reported for MSI scores (e.g., Gemini 1.5 at 72.7% in Tier 5). This absence prevents assessment of whether observed differences between models reflect stable behavioral signatures or prompt sensitivity and sampling variability.
[Mechanistic validation] Circuit-level analysis (mechanistic validation): The U-curve claim and attribution to 'distillation compresses reasoning traces in ways that reactivate shallow statistical associations' requires that logit lens, attention analysis, activation patching, and semantic probing on criminal-bias scenarios isolate distillation-specific circuits. The manuscript does not describe controls for prompt wording, exact differences in training data or alignment procedures, or other model-specific factors across the six models, leaving open the possibility that observed circuit differences arise from unmeasured confounders rather than the distillation step.
[Abstract / Results] Abstract and results on cross-stage validation: The statement that 'socially loaded cues that drive high MSI scores activate the same bias-driving circuits' is presented as validation, but without quantitative measures of circuit overlap, activation strength, or statistical linkage between behavioral MSI tiers and the mechanistic probes, the cross-validation remains qualitative and does not rule out independent effects.

minor comments (3)

[Behavioral stage] Clarify the exact definitions and boundary criteria for the seven tiers and the MSI bias threshold; these appear ad-hoc and affect reproducibility.
[Introduction] Add references to prior work on mechanistic interpretability of bias (e.g., activation patching in safety contexts) and tiered bias evaluations to situate the contribution.
[Results] Figure or table presenting the six models' parameter counts, training details, and exact MSI values per tier would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments identify important gaps in statistical reporting, experimental controls, and quantitative validation that we will address. Below we respond point by point and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: Behavioral profiling section (and abstract): No sample sizes, number of prompts per tier, statistical tests, error bars, or baseline comparisons are reported for MSI scores (e.g., Gemini 1.5 at 72.7% in Tier 5). This absence prevents assessment of whether observed differences between models reflect stable behavioral signatures or prompt sensitivity and sampling variability.

Authors: We agree that the absence of these details limits interpretability. The original experiments used 200 prompts per tier, generated from seven systematically varied templates with controlled lexical substitutions. In the revised manuscript we will report the exact prompt counts, include standard errors and 95% confidence intervals on all MSI values, add pairwise t-tests (Bonferroni-corrected) comparing models within each tier, and introduce a non-moral baseline prompt set to quantify specificity. Error bars will be added to the relevant figures. These changes will allow readers to assess stability versus sampling variability. revision: yes
Referee: Circuit-level analysis (mechanistic validation): The U-curve claim and attribution to 'distillation compresses reasoning traces in ways that reactivate shallow statistical associations' requires that logit lens, attention analysis, activation patching, and semantic probing on criminal-bias scenarios isolate distillation-specific circuits. The manuscript does not describe controls for prompt wording, exact differences in training data or alignment procedures, or other model-specific factors across the six models, leaving open the possibility that observed circuit differences arise from unmeasured confounders rather than the distillation step.

Authors: We selected the six models to include matched-parameter pairs (instruction-tuned base versus reasoning-distilled variant) precisely to reduce size confounds. We acknowledge, however, that prompt wording and proprietary training details were not fully documented. In revision we will (1) publish the complete set of standardized prompt templates used for all mechanistic probes, (2) add a robustness check in which prompt phrasing is varied while holding semantics fixed and report that the U-curve persists, and (3) include an explicit limitations paragraph noting that exact training-data differences cannot be disclosed for closed models. The consistent pattern across both open-weight and proprietary models still supports the distillation interpretation, but we will present it as correlational rather than strictly causal. revision: partial
Referee: Abstract and results on cross-stage validation: The statement that 'socially loaded cues that drive high MSI scores activate the same bias-driving circuits' is presented as validation, but without quantitative measures of circuit overlap, activation strength, or statistical linkage between behavioral MSI tiers and the mechanistic probes, the cross-validation remains qualitative and does not rule out independent effects.

Authors: The current cross-validation is indeed qualitative, relying on the selection of high-MSI scenarios for circuit probing. We will strengthen it by adding two quantitative metrics: (i) Jaccard overlap between the top-5% most activated neurons identified via logit lens on behavioral high-MSI prompts versus mechanistic probes, and (ii) Pearson correlation between per-tier MSI scores and mean activation strength in the identified bias circuits. These statistics, together with a permutation test for significance, will be reported in the revised results section and abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is observational and self-contained

full rationale

The paper introduces MSI as an independent metric quantifying biased output probability over seven explicit tiers, computes it directly on four models' outputs, selects the highest-MSI criminal-bias scenarios as probes, and applies standard interpretability methods (logit lens, attention analysis, activation patching, semantic probing) to six models across capability tiers. The U-curve is reported as an observed pattern from these separate analyses, with cross-validation consisting of noting that the same cues drive both high MSI and the identified circuits. No equation or claim reduces by construction to a fitted parameter, self-definition, or self-citation chain; the stages remain sequential empirical steps rather than definitional equivalence.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on the validity of the MSI definition, the graduated tier construction, and the assumption that interpretability methods correctly map to bias circuits; these are introduced or presupposed without external benchmarks in the abstract.

free parameters (2)

MSI bias threshold per tier
Probability cutoff defining biased output is chosen to produce the reported percentages.
Tier boundary definitions
Exact wording and framing that separate the seven tiers from abstract to socioeconomic injustice.

axioms (2)

domain assumption The selected scenarios isolate moral bias without introducing unrelated linguistic or cultural confounds.
Invoked when mapping high MSI scores to bias circuits.
domain assumption Activation patching and attention analysis reliably identify the circuits responsible for the observed behavioral bias.
Required for the cross-stage validation claim.

pith-pipeline@v0.9.0 · 5641 in / 1542 out tokens · 37629 ms · 2026-05-08T18:29:46.979513+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MSI = αB + βA + γE ... The coefficients α, β, and γ are derived through Multiple Linear Regression fitted on the dataset, specifically utilizing Ordinary Least Squares (OLS)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 8 canonical work pages · 4 internal anchors

[1]

The Moral Machine experiment.Nature, 563(7729):59–64, 2018

Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., Bonnefon, J.-F., and Rahwan, I. The Moral Machine experiment.Nature, 563(7729):59–64, 2018

2018
[1]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review arXiv
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review arXiv
[3]

Man is to computer programmer as woman is to homemaker? Debiasing word embeddings

Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V ., and Kalai, A. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. InAdvances in Neural Information Processing Systems, 2016

2016
[3]

The capacity for moral self-correction in large language models

Ganguli, D., Askell, A., Schiefer, N., Liao, T. I., Lukoši¯ut˙e, K., Chen, A., Goldie, A., Mirho- seini, A., Olsson, C., Hernandez, D., et al. The capacity for moral self-correction in large language models.arXiv preprint arXiv:2302.07459,

work page arXiv
[4]

Unboxing Occupational Bias: Grounded Debiasing of LLMs with U.S

Gorti, A., Gaur, M., and Chadha, A. Unboxing Occupational Bias: Grounded Debiasing of LLMs with U.S. Labor Data.arXiv preprint arXiv:2408.11247,

work page arXiv
[5]

A mathematical framework for transformer circuits

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021

2021
[5]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review arXiv
[6]

Language Model Alignment in Multilingual Trolley Problems.arXiv preprint arXiv:2407.02273,

Jin, Z., Kleiman-Weiner, M., Piatti, G., Levine, S., Liu, J., Gonzalez, F., Ortu, F., Strausz, A., Sachan, M., Mihalcea, R., Choi, Y., and Schölkopf, B. Language Model Alignment in Multilingual Trolley Problems.arXiv preprint arXiv:2407.02273,

work page arXiv
[7]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Röttger, P ., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. XSTest: A test suite for identifying exaggerated safety behaviors in large language models.arXiv preprint arXiv:2308.01263,

work page internal anchor Pith review arXiv
[8]

Moral mimicry: Large language models produce moral rationalizations tailored to political identity.arXiv preprint arXiv:2209.12106,

10 Simmons, G. Moral mimicry: Large language models produce moral rationalizations tailored to political identity.arXiv preprint arXiv:2209.12106,

work page arXiv
[9]

Our tiered evaluation framework and mechanistic analysis are designed to make model biases more transparent and auditable

11 A Ethics Statement This work investigates bias in large language models, a topic with direct ethical implications. Our tiered evaluation framework and mechanistic analysis are designed to make model biases more transparent and auditable. We note that the trolley-problem scenarios used in our study are hypothetical and intended solely as controlled prob...

2016
[10]

Inter- pretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV)

Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., and Sayres, R. Inter- pretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). InProceedings of ICML, 2018

2018
[10]

and contextually varying scenarios (Parrish et al., 2022), recognizing that model behavior is not fixed but shifts with conversational and situational context. Our work extends this trajectory by modeling bias as a graduated, context-dependent process that varies systematically across controlled tiers of moral and social complexity, providing a continuous...

2022
[11]

Locating and editing factual associations in GPT

Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, 2022

2022
[11]

and inconsistent treatment of different demographic groups (Ganguli et al., 2023). These findings motivate a complementary question that our behavioral profiling addresses: at what point does alignment-driven caution override a model’s baseline reasoning, and does that threshold vary across models? The tiered MSI framework provides a principled way to loc...

2023
[12]

StereoSet: Measuring stereotypical bias in pretrained language models

Nadeem, M., Bethke, A., and Reddy, S. StereoSet: Measuring stereotypical bias in pretrained language models. InProceedings of ACL-IJCNLP, 2021

2021
[12]

provides causal validation by ablating individual components and measuring downstream effects, and circuit-level analyses (Elhage et al., 2021; Meng et al.,

2021
[13]

Nangia, N., Vania, C., Bhalerao, R., and Bowman, S. R. CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. InProceedings of EMNLP, 2020. nostalgebraist. interpreting GPT: the logit lens.LessWrong, 2020

2020
[14]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P ., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, 2022

2022
[17]

Tweedie, F. J. and Baayen, R. H. How variable may a constant be? Measures of lexical richness in perspective.Computers and the Humanities, 32(5):323–352, 1998

1998
[18]

Attention is all you need

Polosukhin, I. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[19]

Moral Inflection Point

Wan, Y., Wang, W., He, P ., Gu, J., Bai, H., and Lyu, M. BiasAsker: Measuring the bias in conversational AI system. InProceedings of FSE, 2023. 11 A Ethics Statement This work investigates bias in large language models, a topic with direct ethical implications. Our tiered evaluation framework and mechanistic analysis are designed to make model biases more...

2023