Introduces tiered Moral Sensitivity Index for contextual bias and reports U-curve pattern where reasoning distillation reactivates bias circuits in LLMs.
Moral mimicry: Large language models produce moral rationalizations tailored to political identity
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
citing papers explorer
-
Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability
Introduces tiered Moral Sensitivity Index for contextual bias and reports U-curve pattern where reasoning distillation reactivates bias circuits in LLMs.
-
TrustLLM: Trustworthiness in Large Language Models
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.