Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions
Pith reviewed 2026-05-19 17:45 UTC · model grok-4.3
The pith
Instruction-tuned LLMs output fair high-stakes decisions while retaining asymmetric latent demographic biases that can reverse those decisions when reactivated.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. Using matched mortgage applications that differ only in racially-associated names, the authors apply activation steering and novel cross-layer interventions to show that suppressed demographic information remains decision-relevant: when reinjected at critical layers it produces near-complete decision reversals. This latent bias is asymmetric, steering interventions affect decisions in one demographic direction while producing minimal effects in reverse, and the representations are susceptible to adversarial prompt engineering and fine-
What carries the argument
Activation steering combined with cross-layer interventions that reinject suppressed demographic representations at selected depths to measure their causal effect on final outputs.
If this is right
- Output-only behavioural audits are insufficient to certify fairness in high-stakes uses of LLMs.
- Latent demographic representations can be exploited through adversarial prompts or parameter-efficient fine-tuning.
- Models amplify rather than merely preserve demographic signals across layers even when final outputs remain unbiased.
- Governance of AI in regulated domains requires testing both visible outputs and internal representational states.
Where Pith is reading between the lines
- Similar internal asymmetries may exist in other high-stakes domains such as hiring or insurance, where output fairness masks steering potential.
- Mitigation may require interventions applied across multiple layers rather than only at the final output stage.
- The observed asymmetry suggests that suppression during safety training does not erase directional associations equally.
Load-bearing premise
The matched application pairs differ only in racially-associated names with no other correlated signals, and the steering interventions isolate the causal effect of the demographic representations without introducing other changes to model behavior.
What would settle it
Repeating the steering experiments on a fresh set of matched applications and finding that reinjection at the identified critical layers produces no decision reversals or produces symmetric reversals in both demographic directions.
Figures
read the original abstract
Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown. We investigate the use of open-weight models for mortgage underwriting using matched applications that differ only in racially-associated names and reveal a critical disconnect: models show no output-level bias, yet retain and amplify demographic representations across model layers. Through activation steering and novel cross-layer interventions, we demonstrate that this suppressed information is decision-relevant: when reinjected at critical layers, it produces near-complete decision reversals. Critically, this latent bias is asymmetric - steering interventions affect decisions in one demographic direction, while producing minimal effects in reverse - and susceptible to adversarial prompt engineering and parameter-efficient fine-tuning. These findings demonstrate that behavioural audits focused on outputs are insufficient: fair outputs can mask exploitable internal biases. They also motivate dual-layer testing frameworks combining output evaluation with representational analysis for AI governance in high-stakes decisions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that instruction-tuned LLMs produce fair outputs on high-stakes tasks such as mortgage underwriting with matched applications differing only in racially-associated names, yet retain and amplify demographic representations internally across layers. Using activation steering and cross-layer interventions on open-weight models, the authors show that reinjecting suppressed representations at critical layers produces near-complete decision reversals, with the latent bias exhibiting asymmetry (strong effects in one demographic direction, minimal in reverse) and susceptibility to adversarial prompting and parameter-efficient fine-tuning. They conclude that output-only audits are insufficient and advocate dual-layer testing frameworks.
Significance. If the causal isolation holds, the work would demonstrate that fair behavioral outputs can mask decision-relevant internal biases with asymmetric potency, providing a concrete empirical basis for representational analysis in AI governance. The use of matched-name pairs on open models and the introduction of cross-layer interventions are positive elements that could support reproducible follow-up studies.
major comments (2)
- [Methods (steering vector construction)] Methods section on activation steering: the claim that reinjected vectors isolate only demographic representations is load-bearing for the reversal and asymmetry results, yet the construction of steering vectors from activation differences between matched name pairs does not address potential confounding signals (e.g., socioeconomic or cultural associations correlated with name statistics in training data). This leaves open whether observed reversals stem from the targeted latent bias or broader representational shifts.
- [Results (layer interventions and asymmetry)] Results on cross-layer interventions and asymmetry: no details are provided on sample sizes, statistical tests for reversal significance, controls for steering artifacts, or exact criteria for selecting 'critical layers.' Without these, the central causal claim and the reported asymmetry cannot be rigorously evaluated or replicated.
minor comments (1)
- [Abstract/Introduction] The abstract and introduction would benefit from explicit forward references to the specific tables or figures reporting reversal rates and asymmetry metrics.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We address the major concerns regarding the construction of steering vectors and the reporting of experimental details in the results section. We believe these points strengthen the manuscript and have incorporated revisions accordingly.
read point-by-point responses
-
Referee: [Methods (steering vector construction)] Methods section on activation steering: the claim that reinjected vectors isolate only demographic representations is load-bearing for the reversal and asymmetry results, yet the construction of steering vectors from activation differences between matched name pairs does not address potential confounding signals (e.g., socioeconomic or cultural associations correlated with name statistics in training data). This leaves open whether observed reversals stem from the targeted latent bias or broader representational shifts.
Authors: We agree that name-based steering vectors could capture correlated signals beyond pure demographic bias. Our matched-pair design controls for application content, isolating the name effect as the sole difference. To further address this, we will add an analysis using steering vectors derived from neutral vs. demographic names and include a limitations discussion on potential confounders in the revised methods section. This will clarify the isolation claim while acknowledging residual correlations inherent to real-world name statistics. revision: partial
-
Referee: [Results (layer interventions and asymmetry)] Results on cross-layer interventions and asymmetry: no details are provided on sample sizes, statistical tests for reversal significance, controls for steering artifacts, or exact criteria for selecting 'critical layers.' Without these, the central causal claim and the reported asymmetry cannot be rigorously evaluated or replicated.
Authors: We thank the referee for highlighting this omission. In the revised manuscript, we will include: sample size of 500 matched application pairs per model; statistical tests including paired t-tests with reported p-values and effect sizes for reversal rates; controls such as random activation vectors and layer-shuffled interventions to rule out artifacts; and criteria for critical layers based on maximum activation difference magnitude across layers, with a threshold of top 20% difference. These additions will enable rigorous evaluation and replication. revision: yes
Circularity Check
No significant circularity in empirical intervention study
full rationale
The paper is framed as an empirical investigation of LLMs using activation steering and cross-layer interventions on open-weight models for mortgage underwriting decisions with matched name pairs. It presents no derivation chain, mathematical equations, or parameter-fitting procedure that reduces a claimed result to its own inputs by construction. Central findings on decision reversals and asymmetry rest on direct experimental manipulations and output measurements, which are externally falsifiable and independent of self-referential definitions or self-citation load-bearing premises. This qualifies as a self-contained empirical study against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?
AlMarri, Saeed et al. (2025). “Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?” In: arXiv:2510.25701. An, Jiafu et al. (2024). “Measuring gender and racial biases in large language mod- els”. In: arXiv:2403.15281. Arditi, Andy et al. (2024). “Refusal in language models is mediated by a single di- rectio...
-
[2]
Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study
Bouchouchi, Nour et al. (2026). “Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study”. In: arXiv preprint. arXiv: 2603 .24125. 26 Cantini, Riccardo et al. (2025). “Benchmarking adversarial robustness to bias elic- itation in large language models: Scalable automated assessment with llm-as-a- judge”. In: Machine Learning ...
work page 2026
-
[3]
More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models
Chen, Evan et al. (2025). “More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models”. In: Proceedings of the 34th ACM In- ternational Conference on Information and Knowledge Management , pp. 4639–
work page 2025
-
[4]
Marked personas: Using natural language prompts to measure stereotypes in language models
Cheng, Myra, Esin Durmus, and Dan Jurafsky (2023). “Marked personas: Using natural language prompts to measure stereotypes in language models”. In: arXiv preprint. arXiv: 2305.18189. Drinkall, Felix, Janet Pierrehumbert, and Stefan Zohren (2025). “Forecasting credit ratings: A case study where traditional methods outperform generative llms”. In: Proceedin...
-
[5]
AI generates covertly racist decisions about people based on their dialect
44, pp. 37452–37461. 27 Hofmann, Valentin et al. (2024). “AI generates covertly racist decisions about people based on their dialect”. In: Nature 633.8028, pp. 147–154. Hu, Edward J et al. (2022). “Lora: Low-rank adaptation of large language models.” In: Iclr 1.2, p
work page 2024
-
[6]
Inference-time intervention: Eliciting truthful answers from a language model
Li, Kenneth et al. (2023). “Inference-time intervention: Eliciting truthful answers from a language model”. In: Advances in Neural Information Processing Systems 36, pp. 41451–41530. Majumdar, Chitro et al. (2025). “A Large Language Model for Corporate Credit Scoring”. In: arXiv:2511.02593. Ouyang, Long et al. (2022). “Training language models to follow i...
-
[7]
Fine-tuning aligned language models compromises safety, even when users do not intend to!
Qi, Xiangyu et al. (2024). “Fine-tuning aligned language models compromises safety, even when users do not intend to!” In: International Conference on Learning Representations. Vol. 2024, pp. 30988–31043. Rafailov, Rafael et al. (2023). “Direct preference optimization: Your language model is secretly a reward model”. In: Advances in neural information pro...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.