Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions

Jagdish Tripathy; Marcus Buckmann

arxiv: 2605.15217 · v1 · pith:5PZ6WIBNnew · submitted 2026-05-12 · 💻 cs.AI · cs.CY· cs.LG· econ.GN· q-fin.EC

Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions

Jagdish Tripathy , Marcus Buckmann This is my paper

Pith reviewed 2026-05-19 17:45 UTC · model grok-4.3

classification 💻 cs.AI cs.CYcs.LGecon.GNq-fin.EC

keywords latent biasactivation steeringLLMshigh-stakes decisionsmortgage underwritingdemographic representationsasymmetric biascausal interventions

0 comments

The pith

Instruction-tuned LLMs output fair high-stakes decisions while retaining asymmetric latent demographic biases that can reverse those decisions when reactivated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests open-weight models on mortgage underwriting tasks using applications that differ only in racially-associated names. Models produce no output-level bias yet continue to amplify demographic signals through successive layers. Activation steering and cross-layer interventions show that reinjecting these suppressed signals at critical points produces near-complete decision reversals. The effect is asymmetric, with strong influence in one demographic direction and weak effects in the other, and remains open to prompt-based or fine-tuning attacks. A reader would care because current fairness checks stop at visible outputs and therefore miss internal levers that can be exploited in regulated decisions.

Core claim

Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. Using matched mortgage applications that differ only in racially-associated names, the authors apply activation steering and novel cross-layer interventions to show that suppressed demographic information remains decision-relevant: when reinjected at critical layers it produces near-complete decision reversals. This latent bias is asymmetric, steering interventions affect decisions in one demographic direction while producing minimal effects in reverse, and the representations are susceptible to adversarial prompt engineering and fine-

What carries the argument

Activation steering combined with cross-layer interventions that reinject suppressed demographic representations at selected depths to measure their causal effect on final outputs.

If this is right

Output-only behavioural audits are insufficient to certify fairness in high-stakes uses of LLMs.
Latent demographic representations can be exploited through adversarial prompts or parameter-efficient fine-tuning.
Models amplify rather than merely preserve demographic signals across layers even when final outputs remain unbiased.
Governance of AI in regulated domains requires testing both visible outputs and internal representational states.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar internal asymmetries may exist in other high-stakes domains such as hiring or insurance, where output fairness masks steering potential.
Mitigation may require interventions applied across multiple layers rather than only at the final output stage.
The observed asymmetry suggests that suppression during safety training does not erase directional associations equally.

Load-bearing premise

The matched application pairs differ only in racially-associated names with no other correlated signals, and the steering interventions isolate the causal effect of the demographic representations without introducing other changes to model behavior.

What would settle it

Repeating the steering experiments on a fresh set of matched applications and finding that reinjection at the identified critical layers produces no decision reversals or produces symmetric reversals in both demographic directions.

Figures

Figures reproduced from arXiv: 2605.15217 by Jagdish Tripathy, Marcus Buckmann.

**Figure 1.** Figure 1: Approval Rates by Credit Score (a) Average Approval Rate (b) Approval Interacted with Race similar for White and Black prompts. Prompts with typically White and Black names have approval rates of 27.27% and 27.13%, respectively–a gap of 0.13% driven by 22 discordant pairs out of 1500 (12 approved for White only, 10 approved for Black only); a paired McNemar’s test fails to reject the null of equal approval… view at source ↗

**Figure 2.** Figure 2: Confidence in Decision by Credit Score (a) Margin of Decision (b) Margin of Decision Interacted with Race demography in making credit underwriting decisions. 3.2 Representational Divergence Despite Behavioural Parity We find that the cosine similarity of representation vectors of paired prompts in the residual stream is close to 1 across all the layers (blue dotted line in [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 3.** Figure 3: Cosine Similarity and Representation Divergence Across Layers not result from how the model processes the two sets of race-associated name tokens. Figure A.1 shows that the distributions of residual stream values are near identical for the two groups both at layer 24 (when the divergence begins to grow) and layer 46 (where the divergence peaks). Further, the RMS of the bias vector is ∼1% of the individual … view at source ↗

**Figure 4.** Figure 4: Steering Using Bias Vector (Changes in White Approval Decisions) We find that the steering sensitivity is highly asymmetrical! As shown in the bottom right panel of Appendix Figure A.2, steering representations of Black-associated names that were initially denied a mortgage towards the White distribution is a lot less effective at flipping the decision towards an approval. This also holds true when steer… view at source ↗

**Figure 5.** Figure 5: shows the flip rate from steering representations of the sample of 32 approved White-name applicants (also used for the steering results in the previous sub-section) with the demographic signal from source layers S ∈ {40, 42, 44, 46}. the leftpanel shows the flip rates (F 24,α S ), while the right-panel shows the effectiveness ratio (F 24,α S /F 24,α as described in Section 2.6) [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 6.** Figure 6: Placebo: Steering Using Difference Vector 3.6 Brittleness to prompt-tuning Across four independent runs of the prompt-tuning exercise, the optimisation procedure discovered prompts that substantially reduce the approval rate for applicants with Black-associated names. 17 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Mortgage Approval Rate Over the LoRA Training Progress Note: Each blue and red line shows the approval rate for one of the iterations for White and Black names, respectively. may be insufficient to capture this particular type of semantic information. Thus, while the steering results demonstrate that demographic information is causally relevant to credit decisions, SAE analysis suggests this information ma… view at source ↗

read the original abstract

Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown. We investigate the use of open-weight models for mortgage underwriting using matched applications that differ only in racially-associated names and reveal a critical disconnect: models show no output-level bias, yet retain and amplify demographic representations across model layers. Through activation steering and novel cross-layer interventions, we demonstrate that this suppressed information is decision-relevant: when reinjected at critical layers, it produces near-complete decision reversals. Critically, this latent bias is asymmetric - steering interventions affect decisions in one demographic direction, while producing minimal effects in reverse - and susceptible to adversarial prompt engineering and parameter-efficient fine-tuning. These findings demonstrate that behavioural audits focused on outputs are insufficient: fair outputs can mask exploitable internal biases. They also motivate dual-layer testing frameworks combining output evaluation with representational analysis for AI governance in high-stakes decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs can look fair on mortgage outputs while hiding asymmetric latent biases that steering can flip, but the causal isolation from other signals is not yet convincing.

read the letter

The main thing to know is that this paper finds open-weight LLMs produce fair decisions on matched mortgage applications that differ only by racially-associated names, yet retain demographic representations internally that activation steering can reinject to cause near-complete decision reversals, with the effect running stronger in one demographic direction than the reverse. The asymmetry and the cross-layer intervention angle are the freshest parts. They go beyond simple correlation checks by trying to show that suppressed internal signals remain decision-relevant when brought back online at critical layers. Using matched real-world-style inputs is a reasonable control, and focusing on open-weight models makes the setup easier to inspect than closed systems. That combination gives the work some practical bite for people worried about governance in finance and regulated domains. The paper does a solid job framing why output-only audits might miss exploitable internals and why dual-layer testing could be worth exploring. On the soft spots, the abstract and available description leave out sample sizes, exact statistical tests, layer selection criteria, and any checks for steering artifacts or unintended side effects on other model behaviors. Without those, the size and reliability of the reversals are hard to judge. The concern that name-based activations likely mix in correlated features such as socioeconomic or cultural associations from training data is real and not fully addressed yet; if the steering vectors shift multiple decision-relevant dimensions at once, the asymmetry claim becomes harder to pin specifically on demographic bias. The work shows clear engagement with the literature on mechanistic interpretability and fairness, even if the current evidence is preliminary. This is for readers who already care about moving past surface-level audits in high-stakes AI. A serious referee could help tighten the methods and test the isolation of the demographic signal. I would send it to peer review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper claims that instruction-tuned LLMs produce fair outputs on high-stakes tasks such as mortgage underwriting with matched applications differing only in racially-associated names, yet retain and amplify demographic representations internally across layers. Using activation steering and cross-layer interventions on open-weight models, the authors show that reinjecting suppressed representations at critical layers produces near-complete decision reversals, with the latent bias exhibiting asymmetry (strong effects in one demographic direction, minimal in reverse) and susceptibility to adversarial prompting and parameter-efficient fine-tuning. They conclude that output-only audits are insufficient and advocate dual-layer testing frameworks.

Significance. If the causal isolation holds, the work would demonstrate that fair behavioral outputs can mask decision-relevant internal biases with asymmetric potency, providing a concrete empirical basis for representational analysis in AI governance. The use of matched-name pairs on open models and the introduction of cross-layer interventions are positive elements that could support reproducible follow-up studies.

major comments (2)

[Methods (steering vector construction)] Methods section on activation steering: the claim that reinjected vectors isolate only demographic representations is load-bearing for the reversal and asymmetry results, yet the construction of steering vectors from activation differences between matched name pairs does not address potential confounding signals (e.g., socioeconomic or cultural associations correlated with name statistics in training data). This leaves open whether observed reversals stem from the targeted latent bias or broader representational shifts.
[Results (layer interventions and asymmetry)] Results on cross-layer interventions and asymmetry: no details are provided on sample sizes, statistical tests for reversal significance, controls for steering artifacts, or exact criteria for selecting 'critical layers.' Without these, the central causal claim and the reported asymmetry cannot be rigorously evaluated or replicated.

minor comments (1)

[Abstract/Introduction] The abstract and introduction would benefit from explicit forward references to the specific tables or figures reporting reversal rates and asymmetry metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address the major concerns regarding the construction of steering vectors and the reporting of experimental details in the results section. We believe these points strengthen the manuscript and have incorporated revisions accordingly.

read point-by-point responses

Referee: [Methods (steering vector construction)] Methods section on activation steering: the claim that reinjected vectors isolate only demographic representations is load-bearing for the reversal and asymmetry results, yet the construction of steering vectors from activation differences between matched name pairs does not address potential confounding signals (e.g., socioeconomic or cultural associations correlated with name statistics in training data). This leaves open whether observed reversals stem from the targeted latent bias or broader representational shifts.

Authors: We agree that name-based steering vectors could capture correlated signals beyond pure demographic bias. Our matched-pair design controls for application content, isolating the name effect as the sole difference. To further address this, we will add an analysis using steering vectors derived from neutral vs. demographic names and include a limitations discussion on potential confounders in the revised methods section. This will clarify the isolation claim while acknowledging residual correlations inherent to real-world name statistics. revision: partial
Referee: [Results (layer interventions and asymmetry)] Results on cross-layer interventions and asymmetry: no details are provided on sample sizes, statistical tests for reversal significance, controls for steering artifacts, or exact criteria for selecting 'critical layers.' Without these, the central causal claim and the reported asymmetry cannot be rigorously evaluated or replicated.

Authors: We thank the referee for highlighting this omission. In the revised manuscript, we will include: sample size of 500 matched application pairs per model; statistical tests including paired t-tests with reported p-values and effect sizes for reversal rates; controls such as random activation vectors and layer-shuffled interventions to rule out artifacts; and criteria for critical layers based on maximum activation difference magnitude across layers, with a threshold of top 20% difference. These additions will enable rigorous evaluation and replication. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical intervention study

full rationale

The paper is framed as an empirical investigation of LLMs using activation steering and cross-layer interventions on open-weight models for mortgage underwriting decisions with matched name pairs. It presents no derivation chain, mathematical equations, or parameter-fitting procedure that reduces a claimed result to its own inputs by construction. Central findings on decision reversals and asymmetry rest on direct experimental manipulations and output measurements, which are externally falsifiable and independent of self-referential definitions or self-citation load-bearing premises. This qualifies as a self-contained empirical study against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5730 in / 1013 out tokens · 40348 ms · 2026-05-19T17:45:43.522263+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?

AlMarri, Saeed et al. (2025). “Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?” In: arXiv:2510.25701. An, Jiafu et al. (2024). “Measuring gender and racial biases in large language mod- els”. In: arXiv:2403.15281. Arditi, Andy et al. (2024). “Refusal in language models is mediated by a single di- rectio...

work page arXiv 2025
[2]

Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

Bouchouchi, Nour et al. (2026). “Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study”. In: arXiv preprint. arXiv: 2603 .24125. 26 Cantini, Riccardo et al. (2025). “Benchmarking adversarial robustness to bias elic- itation in large language models: Scalable automated assessment with llm-as-a- judge”. In: Machine Learning ...

work page 2026
[3]

More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models

Chen, Evan et al. (2025). “More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models”. In: Proceedings of the 34th ACM In- ternational Conference on Information and Knowledge Management , pp. 4639–

work page 2025
[4]

Marked personas: Using natural language prompts to measure stereotypes in language models

Cheng, Myra, Esin Durmus, and Dan Jurafsky (2023). “Marked personas: Using natural language prompts to measure stereotypes in language models”. In: arXiv preprint. arXiv: 2305.18189. Drinkall, Felix, Janet Pierrehumbert, and Stefan Zohren (2025). “Forecasting credit ratings: A case study where traditional methods outperform generative llms”. In: Proceedin...

work page arXiv 2023
[5]

AI generates covertly racist decisions about people based on their dialect

44, pp. 37452–37461. 27 Hofmann, Valentin et al. (2024). “AI generates covertly racist decisions about people based on their dialect”. In: Nature 633.8028, pp. 147–154. Hu, Edward J et al. (2022). “Lora: Low-rank adaptation of large language models.” In: Iclr 1.2, p

work page 2024
[6]

Inference-time intervention: Eliciting truthful answers from a language model

Li, Kenneth et al. (2023). “Inference-time intervention: Eliciting truthful answers from a language model”. In: Advances in Neural Information Processing Systems 36, pp. 41451–41530. Majumdar, Chitro et al. (2025). “A Large Language Model for Corporate Credit Scoring”. In: arXiv:2511.02593. Ouyang, Long et al. (2022). “Training language models to follow i...

work page arXiv 2023
[7]

Fine-tuning aligned language models compromises safety, even when users do not intend to!

Qi, Xiangyu et al. (2024). “Fine-tuning aligned language models compromises safety, even when users do not intend to!” In: International Conference on Learning Representations. Vol. 2024, pp. 30988–31043. Rafailov, Rafael et al. (2023). “Direct preference optimization: Your language model is secretly a reward model”. In: Advances in neural information pro...

work page arXiv 2024

[1] [1]

Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?

AlMarri, Saeed et al. (2025). “Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?” In: arXiv:2510.25701. An, Jiafu et al. (2024). “Measuring gender and racial biases in large language mod- els”. In: arXiv:2403.15281. Arditi, Andy et al. (2024). “Refusal in language models is mediated by a single di- rectio...

work page arXiv 2025

[2] [2]

Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

Bouchouchi, Nour et al. (2026). “Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study”. In: arXiv preprint. arXiv: 2603 .24125. 26 Cantini, Riccardo et al. (2025). “Benchmarking adversarial robustness to bias elic- itation in large language models: Scalable automated assessment with llm-as-a- judge”. In: Machine Learning ...

work page 2026

[3] [3]

More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models

Chen, Evan et al. (2025). “More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models”. In: Proceedings of the 34th ACM In- ternational Conference on Information and Knowledge Management , pp. 4639–

work page 2025

[4] [4]

Marked personas: Using natural language prompts to measure stereotypes in language models

Cheng, Myra, Esin Durmus, and Dan Jurafsky (2023). “Marked personas: Using natural language prompts to measure stereotypes in language models”. In: arXiv preprint. arXiv: 2305.18189. Drinkall, Felix, Janet Pierrehumbert, and Stefan Zohren (2025). “Forecasting credit ratings: A case study where traditional methods outperform generative llms”. In: Proceedin...

work page arXiv 2023

[5] [5]

AI generates covertly racist decisions about people based on their dialect

44, pp. 37452–37461. 27 Hofmann, Valentin et al. (2024). “AI generates covertly racist decisions about people based on their dialect”. In: Nature 633.8028, pp. 147–154. Hu, Edward J et al. (2022). “Lora: Low-rank adaptation of large language models.” In: Iclr 1.2, p

work page 2024

[6] [6]

Inference-time intervention: Eliciting truthful answers from a language model

Li, Kenneth et al. (2023). “Inference-time intervention: Eliciting truthful answers from a language model”. In: Advances in Neural Information Processing Systems 36, pp. 41451–41530. Majumdar, Chitro et al. (2025). “A Large Language Model for Corporate Credit Scoring”. In: arXiv:2511.02593. Ouyang, Long et al. (2022). “Training language models to follow i...

work page arXiv 2023

[7] [7]

Fine-tuning aligned language models compromises safety, even when users do not intend to!

Qi, Xiangyu et al. (2024). “Fine-tuning aligned language models compromises safety, even when users do not intend to!” In: International Conference on Learning Representations. Vol. 2024, pp. 30988–31043. Rafailov, Rafael et al. (2023). “Direct preference optimization: Your language model is secretly a reward model”. In: Advances in neural information pro...

work page arXiv 2024