Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

Aman Chadha; Atmika Gorti; Krishnaprasad Thirunarayan; Manas Gaur; Vinija Jain; Yash Aggarwal

REVIEW 2 major objections 2 minor 1 cited by

Reasoning distillation reintroduces criminal bias in LLMs to levels seen in small models despite identical size.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 23:57 UTC pith:2U7RVSTI

load-bearing objection The paper introduces a tiered Moral Sensitivity Index and reports a U-curve where reasoning distillation reintroduces criminal bias, but the mechanistic link rests on uncontrolled comparisons. the 2 major comments →

arxiv 2605.03217 v3 pith:2U7RVSTI submitted 2026-05-04 cs.LG cs.CY

Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

Yash Aggarwal , Atmika Gorti , Vinija Jain , Aman Chadha , Krishnaprasad Thirunarayan , Manas Gaur This is my paper

classification cs.LG cs.CY

keywords moral sensitivity indexcontextual biasLLMsmechanistic interpretabilitybehavioral profilingreasoning distillationU-curve bias

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Moral Sensitivity Index to measure how bias emerges gradually across seven tiers of context, from abstract math to socioeconomic injustice. Behavioral tests on four major models show distinct patterns tied to their alignment methods. Mechanistic probes on the highest-bias criminal scenarios then trace the same patterns to specific circuits, revealing that instruction tuning removes bias but reasoning distillation restores it by reactivating shallow associations.

Core claim

Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. The socially loaded cues driving high MSI scores activate the same bias-driving circuits.

What carries the argument

The Moral Sensitivity Index (MSI), which scores the probability of biased output over a graduated seven-tier stress test, paired with logit lens, attention analysis, activation patching, and semantic probing applied to criminal-bias scenarios.

Load-bearing premise

The selected criminal-bias scenarios and the specific interpretability probes isolate the circuits that actually drive bias rather than other correlated model behaviors.

What would settle it

Re-running the same mechanistic probes on non-criminal bias scenarios or with different stress-test tiers produces no U-curve or no overlap between high-MSI prompts and the identified circuits.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Binary bias labels miss the graded, context-dependent way bias appears, so evaluations need tiered stress tests.
Identity-based safety training produces sharp suppression while other designs allow bias to surface at specific tiers.
Distillation can undo the bias-reducing effects of scaling even at fixed parameter count.
Behavioral MSI scores and circuit activations converge on the same prompts, linking surface outputs to internal mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines may need explicit checks after distillation steps to prevent reactivation of statistical shortcuts.
The U-curve pattern could be tested on other bias domains such as gender or political framing to check generality.
If the reactivation effect holds, it implies limits on using distillation to transfer reasoning without also transferring unwanted associations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The paper introduces a tiered Moral Sensitivity Index and reports a U-curve where reasoning distillation reintroduces criminal bias, but the mechanistic link rests on uncontrolled comparisons.

read the letter

The main thing here is the Moral Sensitivity Index, a seven-tier scale that moves from abstract math problems to socioeconomic injustice scenarios, plus the claim that criminal bias follows a U-curve across model tiers: strong in SLMs, gone after instruction tuning, back after reasoning distillation even at the same size. They apply this to Claude 3.5, Qwen, Llama 3, and Gemini, then run logit lens, attention analysis, activation patching, and semantic probing on the high-MSI criminal-bias cases.

What works is the attempt to link behavioral signatures to circuits rather than stopping at output labels. The tiered design captures context sensitivity that binary checks miss, and the model comparisons highlight how different alignment steps produce different bias patterns, such as Gemini's high MSI under socioeconomic framing versus Claude's suppression.

The soft spots are in the evidence for the proposed mechanism. The abstract states that distillation reactivates the same bias circuits through compression of reasoning traces, but it does not describe a controlled setup that holds architecture, pretraining data, and other variables fixed while changing only the distillation step. Without that, the U-curve could reflect any of the many differences between the three tiers. No quantitative MSI scores, error bars, or dataset details appear in the provided text, which makes it hard to judge whether the tiers are stable or whether the probes cleanly isolate the claimed effect.

This is for alignment and interpretability researchers who want to test graduated bias metrics on current models. It is worth sending to peer review so the controls and numbers can be checked, even though the central causal claim will need more support.

Referee Report

2 major / 2 minor

Summary. The paper introduces a seven-tier Moral Sensitivity Index (MSI) to measure contextual bias in LLMs beyond binary labels, applies it behaviorally to Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5, then performs mechanistic interpretability (logit lens, attention analysis, activation patching, semantic probing) on criminal-bias scenarios across six models in three tiers (SLMs, instruction-tuned, reasoning-distilled). It reports a U-curve: strong bias in SLMs, elimination upon instruction tuning, and reintroduction to SLM-like levels after reasoning distillation despite matched parameter counts, with cross-validation that high-MSI cues activate the same circuits.

Significance. If the U-curve and its mechanistic attribution to distillation-induced compression of reasoning traces hold under controlled conditions, the result would be significant for alignment research: it would demonstrate that standard post-training steps can reactivate shallow statistical associations and would motivate more targeted circuit-level interventions. The tiered MSI itself offers a useful methodological step beyond binary bias tests, and the cross-stage validation between behavioral and mechanistic analyses is a strength when properly documented.

major comments (2)

[Abstract / Mechanistic validation] Abstract and mechanistic validation section: the central U-curve claim requires a controlled comparison in which architecture, pre-training data, and alignment procedures are held fixed while varying only the reasoning-distillation step. The manuscript supplies no description, table, or appendix documenting such controls across the six models; without them the observed reintroduction of bias cannot be attributed specifically to distillation rather than to any of the many uncontrolled differences between the three tiers.
[Abstract] Abstract: the headline finding that 'reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts' is load-bearing for the proposed mechanism, yet no quantitative MSI values, circuit overlap statistics, or activation-patching effect sizes are reported for the three tiers, preventing evaluation of whether the reactivation is statistically reliable or merely correlational.

minor comments (2)

[Abstract] The abstract states that criminal-bias scenarios produced the highest MSI scores and were therefore selected as probes, but provides no table or figure showing the per-tier MSI breakdown that justified this choice.
[Abstract] Clarify the exact overlap between the four models evaluated with MSI and the six models used for circuit analysis; the current wording leaves ambiguous whether the mechanistic results directly explain the behavioral signatures reported for Claude, Qwen, Llama, and Gemini.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of experimental control and quantitative reporting. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract / Mechanistic validation] Abstract and mechanistic validation section: the central U-curve claim requires a controlled comparison in which architecture, pre-training data, and alignment procedures are held fixed while varying only the reasoning-distillation step. The manuscript supplies no description, table, or appendix documenting such controls across the six models; without them the observed reintroduction of bias cannot be attributed specifically to distillation rather than to any of the many uncontrolled differences between the three tiers.

Authors: We selected the six models to form matched pairs across tiers (SLM vs. instruction-tuned vs. reasoning-distilled) with identical parameter counts where possible, and the manuscript states that criminal-bias scenarios were applied to this controlled set. However, we agree that an explicit table or appendix documenting known architecture families, pre-training corpora, and alignment procedures for each model is absent. In the revision we will add a supplementary table listing these specifications (noting that some pre-training details remain proprietary) and clarify which comparisons hold architecture fixed while varying only the distillation step. This will allow readers to better evaluate the attribution. revision: yes
Referee: [Abstract] Abstract: the headline finding that 'reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts' is load-bearing for the proposed mechanism, yet no quantitative MSI values, circuit overlap statistics, or activation-patching effect sizes are reported for the three tiers, preventing evaluation of whether the reactivation is statistically reliable or merely correlational.

Authors: The abstract provides a high-level summary of the U-curve; the full quantitative results—including per-tier MSI scores, circuit overlap percentages from semantic probing, and activation-patching effect sizes—are reported in the results section, figures, and supplementary materials. To improve accessibility, we will revise the abstract to include representative numerical values (e.g., mean MSI for each tier and key overlap statistics) while remaining within length limits. This change will make the strength of the reactivation claim directly evaluable from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of existing models

full rationale

The paper defines the Moral Sensitivity Index (MSI) as a new metric and applies it plus standard mechanistic probes (logit lens, activation patching, etc.) to off-the-shelf models across tiers. No equations, fitted parameters, or predictions are defined in terms of the target quantities; the U-curve is an observed pattern from direct evaluation rather than a derivation that reduces to its inputs by construction. No self-citation chains or ansatzes are invoked as load-bearing premises. The work is self-contained against external model benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the MSI definition, tier construction, and circuit identification rest on unstated choices about scenario design and probe validity that cannot be audited from the provided text.

pith-pipeline@v0.9.1-grok · 5872 in / 1209 out tokens · 36374 ms · 2026-06-30T23:57:16.207200+00:00 · methodology

0 comments

read the original abstract

Large language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation. In the behavioral stage, we introduce the Moral Sensitivity Index (MSI), a metric that quantifies the probability of biased output across a graduated, seven-tier stress test ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating four leading models (Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5), we identify distinct behavioral signatures shaped by alignment design: for instance, Gemini 1.5 reaches 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibits sharp suppression consistent with identity-based safety training. We then verify these behavioral patterns mechanistically. We select criminal-bias scenarios, which produced the highest MSI scores across models, as probes and apply logit lens, attention analysis, activation patching, and semantic probing to a controlled set of six models spanning three capability tiers: small language models (SLMs), instruction-tuned base models, and reasoning-distilled variants. Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. Critically, the socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.

Figures

Figures reproduced from arXiv: 2605.03217 by Aman Chadha, Atmika Gorti, Krishnaprasad Thirunarayan, Manas Gaur, Vinija Jain, Yash Aggarwal.

**Figure 1.** Figure 1: Layer-by-layer decision trajectories across model tiers: Base Models (7–8B, left), view at source ↗

**Figure 2.** Figure 2: Attention differential for the top 3 criminal-focus and non-criminal-focus heads. Left: Llama (base vs. distilled). Right: Qwen (base vs. distilled). In both families, distillation restructures broad, mid-layer attention patterns into highly localised early/late drivers and strong inhibitors. In both families, distillation reorganizes the relevant attention pattern rather than simply scaling its magnitude:… view at source ↗

**Figure 3.** Figure 3: Semantic valence projection across model families. In both the Llama and Qwen families, reasoning distillation compresses the separation between neutral concepts and the criminal valence axis. In the Qwen family, this compression is severe enough that neutral concepts such as Scientist and Citizen cross into positive criminal valence. details in Appendix H). The OV circuit reconstruction provides qualitati… view at source ↗

**Figure 4.** Figure 4: Bias Rate vs. Lexical Diversity (LD) for Claude across seven tiers. view at source ↗

**Figure 5.** Figure 5: Judgment distribution for Gemini across seven tiers, showing the relative propor view at source ↗

Review history (3 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Where do LLMs Fall Short in CBT-Guided Affective Reasoning?
cs.CL 2026-07 conditional novelty 6.5

CBT knowledge alone does not change LLM therapeutic strategy; MCoT guidance yields only ~1.2–1.3% Protocol Leverage Force and models stay biased toward Validation & Reflection.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

The Capacity for Moral Self-Correction in Large Language Models

Ganguli, D., Askell, A., Schiefer, N., Liao, T. I., Lukoši¯ut˙e, K., Chen, A., Goldie, A., Mirho- seini, A., Olsson, C., Hernandez, D., et al. The capacity for moral self-correction in large language models.arXiv preprint arXiv:2302.07459,

work page Pith review arXiv
[4]

Unboxing Occupational Bias: Grounded Debiasing of LLMs with U.S

Gorti, A., Gaur, M., and Chadha, A. Unboxing Occupational Bias: Grounded Debiasing of LLMs with U.S. Labor Data.arXiv preprint arXiv:2408.11247,

work page arXiv
[5]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Language Model Alignment in Multilingual Trolley Problems.arXiv preprint arXiv:2407.02273,

Jin, Z., Kleiman-Weiner, M., Piatti, G., Levine, S., Liu, J., Gonzalez, F., Ortu, F., Strausz, A., Sachan, M., Mihalcea, R., Choi, Y., and Schölkopf, B. Language Model Alignment in Multilingual Trolley Problems.arXiv preprint arXiv:2407.02273,

work page arXiv
[7]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Röttger, P ., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. XSTest: A test suite for identifying exaggerated safety behaviors in large language models.arXiv preprint arXiv:2308.01263,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Moral mimicry: Large language models produce moral rationalizations tailored to political identity.arXiv preprint arXiv:2209.12106,

10 Simmons, G. Moral mimicry: Large language models produce moral rationalizations tailored to political identity.arXiv preprint arXiv:2209.12106,

work page arXiv
[9]

Our tiered evaluation framework and mechanistic analysis are designed to make model biases more transparent and auditable

11 A Ethics Statement This work investigates bias in large language models, a topic with direct ethical implications. Our tiered evaluation framework and mechanistic analysis are designed to make model biases more transparent and auditable. We note that the trolley-problem scenarios used in our study are hypothetical and intended solely as controlled prob...

work page 2016
[10]

and contextually varying scenarios (Parrish et al., 2022), recognizing that model behavior is not fixed but shifts with conversational and situational context. Our work extends this trajectory by modeling bias as a graduated, context-dependent process that varies systematically across controlled tiers of moral and social complexity, providing a continuous...

work page 2022
[11]

and inconsistent treatment of different demographic groups (Ganguli et al., 2023). These findings motivate a complementary question that our behavioral profiling addresses: at what point does alignment-driven caution override a model’s baseline reasoning, and does that threshold vary across models? The tiered MSI framework provides a principled way to loc...

work page 2023
[12]

provides causal validation by ablating individual components and measuring downstream effects, and circuit-level analyses (Elhage et al., 2021; Meng et al.,

work page 2021

[1] [1]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

The Capacity for Moral Self-Correction in Large Language Models

Ganguli, D., Askell, A., Schiefer, N., Liao, T. I., Lukoši¯ut˙e, K., Chen, A., Goldie, A., Mirho- seini, A., Olsson, C., Hernandez, D., et al. The capacity for moral self-correction in large language models.arXiv preprint arXiv:2302.07459,

work page Pith review arXiv

[4] [4]

Unboxing Occupational Bias: Grounded Debiasing of LLMs with U.S

Gorti, A., Gaur, M., and Chadha, A. Unboxing Occupational Bias: Grounded Debiasing of LLMs with U.S. Labor Data.arXiv preprint arXiv:2408.11247,

work page arXiv

[5] [5]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Language Model Alignment in Multilingual Trolley Problems.arXiv preprint arXiv:2407.02273,

Jin, Z., Kleiman-Weiner, M., Piatti, G., Levine, S., Liu, J., Gonzalez, F., Ortu, F., Strausz, A., Sachan, M., Mihalcea, R., Choi, Y., and Schölkopf, B. Language Model Alignment in Multilingual Trolley Problems.arXiv preprint arXiv:2407.02273,

work page arXiv

[7] [7]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Röttger, P ., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. XSTest: A test suite for identifying exaggerated safety behaviors in large language models.arXiv preprint arXiv:2308.01263,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Moral mimicry: Large language models produce moral rationalizations tailored to political identity.arXiv preprint arXiv:2209.12106,

10 Simmons, G. Moral mimicry: Large language models produce moral rationalizations tailored to political identity.arXiv preprint arXiv:2209.12106,

work page arXiv

[9] [9]

Our tiered evaluation framework and mechanistic analysis are designed to make model biases more transparent and auditable

11 A Ethics Statement This work investigates bias in large language models, a topic with direct ethical implications. Our tiered evaluation framework and mechanistic analysis are designed to make model biases more transparent and auditable. We note that the trolley-problem scenarios used in our study are hypothetical and intended solely as controlled prob...

work page 2016

[10] [10]

and contextually varying scenarios (Parrish et al., 2022), recognizing that model behavior is not fixed but shifts with conversational and situational context. Our work extends this trajectory by modeling bias as a graduated, context-dependent process that varies systematically across controlled tiers of moral and social complexity, providing a continuous...

work page 2022

[11] [11]

and inconsistent treatment of different demographic groups (Ganguli et al., 2023). These findings motivate a complementary question that our behavioral profiling addresses: at what point does alignment-driven caution override a model’s baseline reasoning, and does that threshold vary across models? The tiered MSI framework provides a principled way to loc...

work page 2023

[12] [12]

provides causal validation by ablating individual components and measuring downstream effects, and circuit-level analyses (Elhage et al., 2021; Meng et al.,

work page 2021