Causal Bias Detection in Generative Artificial Intelligence

Drago Plecko

arxiv: 2605.11365 · v2 · pith:AYL5TBQ6new · submitted 2026-05-12 · 💻 cs.AI · cs.LG· stat.ML

Causal Bias Detection in Generative Artificial Intelligence

Drago Plecko This is my paper

Pith reviewed 2026-05-20 23:02 UTC · model grok-4.3

classification 💻 cs.AI cs.LGstat.ML

keywords causal fairnessgenerative AIlarge language modelsbias detectioncausal decompositionfairness quantificationmechanism replacement

0 comments

The pith

Causal fairness in generative AI unifies with standard ML through decompositions of bias along pathways and mechanism replacements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to formalize causal fairness for generative models that sample from their own learned conditionals over multiple variables rather than fitting one predictor. This leads to new decomposition results that break fairness violations into separate contributions from distinct causal pathways and from the gap between real-world mechanisms and those implied by the model. A sympathetic reader would care because high-stakes generative systems such as large language models can embed and amplify disparities in ways that single-predictor fairness tools miss, and granular measurement supplies clearer targets for correction. Identification conditions are stated so that the target quantities can be recovered from data or direct model queries, and estimators are derived to make the quantities computable in practice.

Core claim

We formalize the problem of causal fairness in generative AI and unify it with the standard ML setting under a common theoretical framework. We then derive new causal decomposition results that enable granular quantification of fairness impacts along both (a) different causal pathways and (b) the replacement of real-world mechanisms by the generative model's mechanisms. We establish identification conditions and introduce efficient estimators for causal quantities of interest, and demonstrate the value of our methodology by analyzing race and gender bias in large language models across different datasets.

What carries the argument

Causal decomposition results that separate fairness impacts into effects transmitted along specific causal pathways and effects that arise when the generative model substitutes its own mechanisms for those in the real world.

If this is right

Audits of generative models can now attribute measured disparities to either transmission along existing data paths or to novel bias introduced by the model's own conditional distributions.
Fairness interventions can be targeted at specific mechanisms inside the generative process rather than applied uniformly.
The same decomposition framework covers both the classic single-predictor setting and the more general generative setting, allowing direct comparison of bias sources across model types.
Practical estimation becomes feasible for race and gender bias analysis in large language models without requiring full knowledge of every causal mechanism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support regulatory audits that require generative systems to report both path-specific and mechanism-replacement contributions to observed disparities.
Similar decompositions might be tested on image or video generators to see whether the same separation of pathway and mechanism effects appears in non-text domains.
If the framework is adopted, training objectives could be augmented with penalties that penalize large mechanism-replacement gaps on protected attributes.

Load-bearing premise

The identification conditions for the causal quantities hold, so that the estimators recover the target fairness measures from observable data or model queries.

What would settle it

A simulation in which a generative model is built with a fully known causal graph and injected bias sources, yet the derived estimators return values that deviate from the ground-truth pathway and mechanism contributions.

Figures

Figures reproduced from arXiv: 2605.11365 by Drago Plecko.

**Figure 2.** Figure 2: Graphical models for (a) x-specific direct effect; (b) generic potential outcome in S-SFM. nism f rw Y , or according to the ML model f m Y . In this way, the S-node can be used to denote from which generative environment (real world or model) the data is sampled. In context of generative AI, as mentioned earlier, the S-node must point to all covariates X, Z, W, and Y , since generative models are able to … view at source ↗

**Figure 3.** Figure 3: For both potential outcomes, S = s0 for each mechanism, meaning that all the mechanisms are from the real world. The difference between the two potential outcomes lies in the value of X 4 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 3.** Figure 3: Quantifying differences using S-SFM potential outcomes. along the direct path X → Y , and thus captures the direct effect of a x0 → x1 transition in the real world. We contrast this with the difference between (c) and (d) of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Standard Fairness Models for the three datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Hierarchical clustering of model bias signatures (L1 distance, Ward linkage). developer families and parameter scales. Further, while the Llama 3 siblings sit at L1 = 0.39, the Qwen 3.5 pair is at 0.62 and the Gemma 3 pair at 1.22 – both farther apart than many cross-family pairs. These mixed groupings motivate a formal test of whether family membership reliably predicts bias similarity. Using a permutatio… view at source ↗

**Figure 6.** Figure 6: Counterfactual graph for proof of Prop. 3. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Similarity of model bias signatures: (a) full pairwise [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: TV decomposition into ∆x-DE, ∆x-IE, and ∆x-SE for Gemma 3 27B on NSDUH. be low-earners. Indirect (0.3% ± 1.3%) and spurious (0.5% ± 2.7%) effects are both small and not significant, indicating that the disparity does not flow through observed mediators or confounders. Similarity of Bias Signatures [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: TV decomposition into ∆x-DE, ∆x-IE, and ∆x-SE for Qwen 3.5 27B on BRFSS. under Gemma’s fY , the direct effect is sensitive to the distribution of W, and Gemma’s fW produces a W | X distribution that pushes the direct effect toward stereotyping minorities as more likely to use marijuana. Finally, the fX,Z replacement shifts the direct effect by −5.3% ± 8.8%, and the fully-replaced DEs1 = −1.0% ± 8.3% no lon… view at source ↗

read the original abstract

Automated systems built on artificial intelligence (AI) are increasingly deployed across high-stakes domains, raising critical concerns about fairness and the perpetuation of demographic disparities that exist in the world. In this context, causal inference provides a principled framework for reasoning about fairness, as it links observed disparities to underlying mechanisms and aligns naturally with human intuition and legal notions of discrimination. Prior work on causal fairness primarily focuses on the standard machine learning setting, where a decision-maker constructs a single predictive mechanism $f_{\widehat Y}$ for an outcome variable $Y$, while inheriting the causal mechanisms of all other covariates from the real world. The generative AI setting, however, is markedly more complex: generative models can sample from arbitrary conditionals over any set of variables, implicitly constructing their own beliefs about all causal mechanisms rather than learning a single predictive function. This fundamental difference requires new developments in causal fairness methodology. We formalize the problem of causal fairness in generative AI and unify it with the standard ML setting under a common theoretical framework. We then derive new causal decomposition results that enable granular quantification of fairness impacts along both (a) different causal pathways and (b) the replacement of real-world mechanisms by the generative model's mechanisms. We establish identification conditions and introduce efficient estimators for causal quantities of interest, and demonstrate the value of our methodology by analyzing race and gender bias in large language models across different datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper formalizes causal fairness for generative AI models, which construct their own mechanisms over variables rather than inheriting real-world mechanisms as in standard predictive ML. It unifies the two settings under a common causal framework, derives decompositions quantifying fairness effects along specific pathways and via replacement of real-world mechanisms by the generative model, establishes identification conditions, introduces estimators, and applies the approach to measure race and gender bias in LLMs on multiple datasets.

Significance. If the identification and decomposition results hold, the work fills an important gap by extending causal fairness tools to the more complex generative setting, where models implicitly define causal mechanisms. The pathway and mechanism-replacement decompositions enable finer-grained auditing than total-effect measures alone, and the empirical analysis on LLMs demonstrates applicability to current high-impact systems.

major comments (1)

[§4.2] §4.2, Identification Result 2: The identification of the mechanism-replacement effect relies on the assumption that queries to the generative model can isolate the replacement of specific real-world mechanisms without residual dependence on training-data confounders; this is load-bearing for the estimators in §5 but receives only a brief justification rather than a formal proof or sensitivity analysis.

minor comments (3)

[§3.1] §3.1, Eq. (7): The notation for the do-operator applied to generative sampling is introduced without an explicit definition of the intervention semantics for a black-box model; a short clarifying paragraph would improve readability.
[Table 1] Table 1: The reported standard errors for the LLM bias estimates are not accompanied by the number of Monte Carlo samples or query budget used, making it difficult to assess precision.
[§6] §6: The discussion of limitations mentions computational cost but does not address how the method scales when the generative model is a large autoregressive LLM with thousands of tokens.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review, positive summary of the contribution, and recommendation for minor revision. We address the major comment below.

read point-by-point responses

Referee: [§4.2] §4.2, Identification Result 2: The identification of the mechanism-replacement effect relies on the assumption that queries to the generative model can isolate the replacement of specific real-world mechanisms without residual dependence on training-data confounders; this is load-bearing for the estimators in §5 but receives only a brief justification rather than a formal proof or sensitivity analysis.

Authors: We thank the referee for this observation. The identification result in §4.2 indeed rests on the assumption that targeted queries (e.g., carefully designed prompts) to the generative model can replace a specific real-world mechanism while limiting residual dependence on training-data confounders. The manuscript provides a brief justification based on the controllable conditioning properties of modern generative models such as LLMs. We agree that expanding this into a formal proof sketch and adding a sensitivity analysis would strengthen the presentation and better support the estimators in §5. In the revised version we will augment §4.2 with a more detailed derivation of the identification under the stated assumption and include a sensitivity analysis in §5 that quantifies robustness to potential residual confounding. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds on standard causal identification

full rationale

The paper formalizes causal fairness for generative models by extending existing causal inference concepts to a new setting, deriving pathway decompositions and mechanism-replacement effects, and stating identification conditions that recover quantities from observables or model queries. No equations or steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the central results follow directly from applying standard identification logic without renaming known patterns or smuggling ansatzes via prior author work. The approach remains self-contained against external benchmarks in causal fairness literature.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard causal assumptions plus newly derived identification conditions for the generative setting; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Causal graph and identification conditions for the variables and mechanisms in the generative model hold
Invoked to enable the new decomposition results and estimators as stated in the abstract.

pith-pipeline@v0.9.0 · 5770 in / 1127 out tokens · 34793 ms · 2026-05-20T23:02:43.699171+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 13 internal anchors

[1]

Phi-4 Technical Report

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Angwin, J

J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias: There’s soft- ware used across the country to predict future criminals. and it’s biased against blacks.ProPublica, 5 2016. URLhttps://www.propublica.org/article/ machine-bias-risk-assessments-in-criminal-sentencing

work page 2016
[4]

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Bareinboim.Causal Artificial Intelligence: A Roadmap for Building Causally Intelligent Systems

E. Bareinboim.Causal Artificial Intelligence: A Roadmap for Building Causally Intelligent Systems. Online, 2025. URLhttps://causalai-book.net/. Draft version

work page 2025
[6]

Barocas and A

S. Barocas and A. D. Selbst. Big data’s disparate impact.Calif. L. Rev., 104:671, 2016

work page 2016
[7]

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

F. D. Blau and L. M. Kahn. The gender earnings gap: learning from international comparisons. The American Economic Review, 82(2):533–538, 1992

work page 1992
[9]

F. D. Blau and L. M. Kahn. The gender wage gap: Extent, trends, and explanations.Journal of economic literature, 55(3):789–865, 2017

work page 2017
[10]

Brennan, W

T. Brennan, W. Dieterich, and B. Ehret. Evaluating the predictive validity of the compas risk and needs assessment system.Criminal Justice and Behavior, 36(1):21–40, 2009

work page 2009
[11]

Buolamwini and T

J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In S. A. Friedler and C. Wilson, editors,Proceedings of the 1st Confer- ence on Fairness, Accountability and Transparency, volume 81 ofProceedings of Machine Learning Research, pages 77–91, NY , USA, 2018

work page 2018
[12]

Behavioral Risk Factor Surveillance System Sur- vey Data.https://www.cdc.gov/brfss/, 2023

Centers for Disease Control and Prevention. Behavioral Risk Factor Surveillance System Sur- vey Data.https://www.cdc.gov/brfss/, 2023. U.S. Department of Health and Human Services

work page 2023
[13]

Cheong, S

J. Cheong, S. Kalkan, and H. Gunes. Counterfactual fairness for facial expression recognition. InEuropean Conference on Computer Vision, pages 245–261. Springer, 2022

work page 2022
[14]

Chernozhukov, D

V . Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters, 2018

work page 2018
[15]

S. Chiappa. Path-specific counterfactual fairness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7801–7808, 2019

work page 2019
[16]

J. D. Correa and E. Bareinboim. Counterfactual graphical models: Constraints and inference. InForty-second International Conference on Machine Learning, 2025

work page 2025
[17]

Datta, M

A. Datta, M. C. Tschantz, and A. Datta. Automated experiments on ad privacy settings: A tale of opacity, choice, and discrimination.Proceedings on Privacy Enhancing Technologies, 2015 (1):92–112, Apr. 2015. doi: 10.1515/popets-2015-0007

work page doi:10.1515/popets-2015-0007 2015
[18]

De-Arteaga, A

M. De-Arteaga, A. Romanov, H. Wallach, J. Chayes, C. Borgs, A. Chouldechova, S. Geyik, K. Kenthapadi, and A. T. Kalai. Bias in bios: A case study of semantic representation bias in a high-stakes setting. Inproceedings of the Conference on Fairness, Accountability, and Transparency, pages 120–128, 2019. 10

work page 2019
[19]

S. Garg, V . Perot, N. Limtiaco, A. Taly, E. H. Chi, and A. Beutel. Counterfactual fairness in text classification through robustness. InProceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 219–226, 2019

work page 2019
[20]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach. Women also snowboard: Overcoming bias in captioning models. InProceedings of the European conference on com- puter vision (ECCV), pages 771–787, 2018

work page 2018
[23]

Joo and K

J. Joo and K. Kärkkäinen. Gender slopes: Counterfactual fairness for computer vision mod- els by attribute manipulation. InProceedings of the 2nd international workshop on fairness, accountability, transparency and ethics in multimedia, pages 1–5, 2020

work page 2020
[24]

S. Jung, S. Yu, S. Chun, and T. Moon. Do counterfactually fair image classifiers satisfy group fairness?–a theoretical and empirical study.Advances in Neural Information Processing Sys- tems, 37:56041–56053, 2024

work page 2024
[25]

A. E. Khandani, A. J. Kim, and A. W. Lo. Consumer credit-risk models via machine-learning algorithms.Journal of Banking & Finance, 34(11):2767–2787, 2010

work page 2010
[26]

Avoiding Discrimination through Causal Reasoning

N. Kilbertus, M. Rojas-Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf. Avoiding discrimination through causal reasoning.arXiv preprint arXiv:1706.02744, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

H. Kim, S. Shin, J. Jang, K. Song, W. Joo, W. Kang, and I.-C. Moon. Counterfactual fairness with disentangled causal effect variational autoencoder. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35 (9), pages 8128–8136, 2021

work page 2021
[28]

M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness.Advances in neural information processing systems, 30, 2017

work page 2017
[29]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Sto- ica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023
[30]

Holistic Evaluation of Language Models

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

A. H. Liu, K. Khandelwal, S. Subramanian, V . Jouault, A. Rastogi, A. Sadé, A. Jeffares, A. Jiang, A. Cahill, A. Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Luccioni, C

S. Luccioni, C. Akiki, M. Mitchell, and Y . Jernite. Stable bias: Evaluating societal represen- tations in diffusion models.Advances in Neural Information Processing Systems, 36:56338– 56351, 2023

work page 2023
[33]

J. F. Mahoney and J. M. Mohen. Method and system for loan origination and underwriting, Oct. 23 2007. US Patent 7,287,008

work page 2007
[34]

Nabi and I

R. Nabi and I. Shpitser. Fair inference on outcomes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[35]

Nadeem, A

M. Nadeem, A. Bethke, and S. Reddy. Stereoset: Measuring stereotypical bias in pretrained language models. InProceedings of the 59th annual meeting of the association for computa- tional linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 5356–5371, 2021

work page 2021
[36]

Naik and B

R. Naik and B. Nushi. Social biases through the text-to-image generation lens. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 786–808, 2023. 11

work page 2023
[37]

Nangia, C

N. Nangia, C. Vania, R. Bhalerao, and S. Bowman. Crows-pairs: A challenge dataset for measuring social biases in masked language models. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 1953–1967, 2020

work page 2020
[38]

D. Pager. The mark of a criminal record.American journal of sociology, 108(5):937–975, 2003

work page 2003
[39]

Pearl.Causality: Models, Reasoning, and Inference

J. Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. 2nd edition, 2009

work page 2000
[40]

Pearl and E

J. Pearl and E. Bareinboim. Transportability of causal and statistical relations: A formal ap- proach. InProceedings of the AAAI Conference on Artificial Intelligence, volume 25 (1), pages 247–254, 2011

work page 2011
[41]

Ple ˇcko and E

D. Ple ˇcko and E. Bareinboim. Reconciling predictive and statistical parity: A causal approach. Proceedings of the 38th AAAI Conference on Artificial Intelligence, 2024

work page 2024
[42]

Ple ˇcko and E

D. Ple ˇcko and E. Bareinboim. Causal fairness analysis.Foundations and Trends in Machine Learning, 17 (3):304–589, 2024

work page 2024
[43]

Ple ˇcko and N

D. Ple ˇcko and N. Meinshausen. Fair data adaptation with quantile preservation.Journal of Machine Learning Research, 21:242, 2020

work page 2020
[44]

Ple ˇcko, P

D. Ple ˇcko, P. Okanovi´c, S. Havaldar, T. Hoefler, and E. Bareinboim. Epidemiology of large language models: A benchmark for observational distribution knowledge.arXiv preprint arXiv:2511.03070, 2025. URLhttps://arxiv.org/pdf/2511.03070

work page arXiv 2025
[45]

S. SAMHSA. National Survey on Drug Use and Health (NSDUH).https://www.samhsa. gov/data/data-we-collect/nsduh-national-survey-drug-use-and-health,

work page
[46]

Department of Health and Human Services

U.S. Department of Health and Human Services

work page
[47]

J. Sanburn. Facebook thinks some native american names are inauthentic.Time, Feb. 14 2015. URLhttp://time.com/3710203/facebook-native-american-names/

work page arXiv 2015
[48]

i’m sorry to hear that

E. M. Smith, M. Hall, M. Kambadur, E. Presani, and A. Williams. “i’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 9180–9211, 2022

work page 2022
[49]

L. Sweeney. Discrimination in online ad delivery. Technical Report 2208240, SSRN, Jan. 28

work page
[50]

URLhttp://dx.doi.org/10.2139/ssrn.2208240

work page doi:10.2139/ssrn.2208240
[51]

L. T. Sweeney and C. Haney. The influence of race on sentencing: A meta-analytic review of experimental studies.Behavioral Sciences & the Law, 10(2):179–195, 1992

work page 1992
[52]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mes- nard, B. Shahriari, A. Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Census Bureau

U.S. Census Bureau. American Community Survey 1-Year Estimates, 2023. Retrieved from https://data.census.gov/

work page 2023
[55]

S. Wang, X. Cao, J. Zhang, Z. Yuan, S. Shan, X. Chen, and W. Gao. Vlbiasbench: A com- prehensive benchmark for evaluating bias in large vision-language model.arXiv preprint arXiv:2406.14194, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Y . Wu, L. Zhang, X. Wu, and H. Tong. Pc-fairness: A unified framework for measuring causality-based fairness.Advances in neural information processing systems, 32, 2019

work page 2019
[57]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Zhang and E

J. Zhang and E. Bareinboim. Equality of opportunity in classification: A causal approach. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3671–3681, Montreal, Canada,

work page
[59]

Curran Associates, Inc

work page
[60]

Zhang and E

J. Zhang and E. Bareinboim. Fairness in decision-making—the causal explanation formula. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. 13 Supplementary Material forCausal Bias Detection in Generative Artificial Intelligence The source code for reproducing the results can be found in the anonymized code repository https:...

work page 2018
[61]

Write a single narrative enclosed in <story>...</story>

work page
[62]

Do NOT include headings, lists, analysis, or any text outside the tags

work page
[63]

Mention ALL facts given below exactly once (age, sex, race, education, income, marijuana use last month)

work page
[64]

Keep it under 200 words. known facts to be mentioned: - age = 30-34 years - sex = female - race = White unknown facts to be mentioned: - edu (possible values: <= 8th grade, Some high school, High school graduate, Some college no degree, Associate degree, Bachelor’s or higher) - income (possible values: < $10,000, $10,000 - $19,999, $20,000 - $29,999, $30,...

work page 2023

[1] [1]

Phi-4 Technical Report

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Angwin, J

J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias: There’s soft- ware used across the country to predict future criminals. and it’s biased against blacks.ProPublica, 5 2016. URLhttps://www.propublica.org/article/ machine-bias-risk-assessments-in-criminal-sentencing

work page 2016

[4] [4]

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Bareinboim.Causal Artificial Intelligence: A Roadmap for Building Causally Intelligent Systems

E. Bareinboim.Causal Artificial Intelligence: A Roadmap for Building Causally Intelligent Systems. Online, 2025. URLhttps://causalai-book.net/. Draft version

work page 2025

[6] [6]

Barocas and A

S. Barocas and A. D. Selbst. Big data’s disparate impact.Calif. L. Rev., 104:671, 2016

work page 2016

[7] [7]

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, et al. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

F. D. Blau and L. M. Kahn. The gender earnings gap: learning from international comparisons. The American Economic Review, 82(2):533–538, 1992

work page 1992

[9] [9]

F. D. Blau and L. M. Kahn. The gender wage gap: Extent, trends, and explanations.Journal of economic literature, 55(3):789–865, 2017

work page 2017

[10] [10]

Brennan, W

T. Brennan, W. Dieterich, and B. Ehret. Evaluating the predictive validity of the compas risk and needs assessment system.Criminal Justice and Behavior, 36(1):21–40, 2009

work page 2009

[11] [11]

Buolamwini and T

J. Buolamwini and T. Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In S. A. Friedler and C. Wilson, editors,Proceedings of the 1st Confer- ence on Fairness, Accountability and Transparency, volume 81 ofProceedings of Machine Learning Research, pages 77–91, NY , USA, 2018

work page 2018

[12] [12]

Behavioral Risk Factor Surveillance System Sur- vey Data.https://www.cdc.gov/brfss/, 2023

Centers for Disease Control and Prevention. Behavioral Risk Factor Surveillance System Sur- vey Data.https://www.cdc.gov/brfss/, 2023. U.S. Department of Health and Human Services

work page 2023

[13] [13]

Cheong, S

J. Cheong, S. Kalkan, and H. Gunes. Counterfactual fairness for facial expression recognition. InEuropean Conference on Computer Vision, pages 245–261. Springer, 2022

work page 2022

[14] [14]

Chernozhukov, D

V . Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey, and J. Robins. Double/debiased machine learning for treatment and structural parameters, 2018

work page 2018

[15] [15]

S. Chiappa. Path-specific counterfactual fairness. InProceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7801–7808, 2019

work page 2019

[16] [16]

J. D. Correa and E. Bareinboim. Counterfactual graphical models: Constraints and inference. InForty-second International Conference on Machine Learning, 2025

work page 2025

[17] [17]

Datta, M

A. Datta, M. C. Tschantz, and A. Datta. Automated experiments on ad privacy settings: A tale of opacity, choice, and discrimination.Proceedings on Privacy Enhancing Technologies, 2015 (1):92–112, Apr. 2015. doi: 10.1515/popets-2015-0007

work page doi:10.1515/popets-2015-0007 2015

[18] [18]

De-Arteaga, A

M. De-Arteaga, A. Romanov, H. Wallach, J. Chayes, C. Borgs, A. Chouldechova, S. Geyik, K. Kenthapadi, and A. T. Kalai. Bias in bios: A case study of semantic representation bias in a high-stakes setting. Inproceedings of the Conference on Fairness, Accountability, and Transparency, pages 120–128, 2019. 10

work page 2019

[19] [19]

S. Garg, V . Perot, N. Limtiaco, A. Taly, E. H. Chi, and A. Beutel. Counterfactual fairness in text classification through robustness. InProceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 219–226, 2019

work page 2019

[20] [20]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach. Women also snowboard: Overcoming bias in captioning models. InProceedings of the European conference on com- puter vision (ECCV), pages 771–787, 2018

work page 2018

[23] [23]

Joo and K

J. Joo and K. Kärkkäinen. Gender slopes: Counterfactual fairness for computer vision mod- els by attribute manipulation. InProceedings of the 2nd international workshop on fairness, accountability, transparency and ethics in multimedia, pages 1–5, 2020

work page 2020

[24] [24]

S. Jung, S. Yu, S. Chun, and T. Moon. Do counterfactually fair image classifiers satisfy group fairness?–a theoretical and empirical study.Advances in Neural Information Processing Sys- tems, 37:56041–56053, 2024

work page 2024

[25] [25]

A. E. Khandani, A. J. Kim, and A. W. Lo. Consumer credit-risk models via machine-learning algorithms.Journal of Banking & Finance, 34(11):2767–2787, 2010

work page 2010

[26] [26]

Avoiding Discrimination through Causal Reasoning

N. Kilbertus, M. Rojas-Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf. Avoiding discrimination through causal reasoning.arXiv preprint arXiv:1706.02744, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

H. Kim, S. Shin, J. Jang, K. Song, W. Joo, W. Kang, and I.-C. Moon. Counterfactual fairness with disentangled causal effect variational autoencoder. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35 (9), pages 8128–8136, 2021

work page 2021

[28] [28]

M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness.Advances in neural information processing systems, 30, 2017

work page 2017

[29] [29]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Sto- ica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

work page 2023

[30] [30]

Holistic Evaluation of Language Models

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

A. H. Liu, K. Khandelwal, S. Subramanian, V . Jouault, A. Rastogi, A. Sadé, A. Jeffares, A. Jiang, A. Cahill, A. Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Luccioni, C

S. Luccioni, C. Akiki, M. Mitchell, and Y . Jernite. Stable bias: Evaluating societal represen- tations in diffusion models.Advances in Neural Information Processing Systems, 36:56338– 56351, 2023

work page 2023

[33] [33]

J. F. Mahoney and J. M. Mohen. Method and system for loan origination and underwriting, Oct. 23 2007. US Patent 7,287,008

work page 2007

[34] [34]

Nabi and I

R. Nabi and I. Shpitser. Fair inference on outcomes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018

[35] [35]

Nadeem, A

M. Nadeem, A. Bethke, and S. Reddy. Stereoset: Measuring stereotypical bias in pretrained language models. InProceedings of the 59th annual meeting of the association for computa- tional linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 5356–5371, 2021

work page 2021

[36] [36]

Naik and B

R. Naik and B. Nushi. Social biases through the text-to-image generation lens. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 786–808, 2023. 11

work page 2023

[37] [37]

Nangia, C

N. Nangia, C. Vania, R. Bhalerao, and S. Bowman. Crows-pairs: A challenge dataset for measuring social biases in masked language models. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 1953–1967, 2020

work page 2020

[38] [38]

D. Pager. The mark of a criminal record.American journal of sociology, 108(5):937–975, 2003

work page 2003

[39] [39]

Pearl.Causality: Models, Reasoning, and Inference

J. Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, 2000. 2nd edition, 2009

work page 2000

[40] [40]

Pearl and E

J. Pearl and E. Bareinboim. Transportability of causal and statistical relations: A formal ap- proach. InProceedings of the AAAI Conference on Artificial Intelligence, volume 25 (1), pages 247–254, 2011

work page 2011

[41] [41]

Ple ˇcko and E

D. Ple ˇcko and E. Bareinboim. Reconciling predictive and statistical parity: A causal approach. Proceedings of the 38th AAAI Conference on Artificial Intelligence, 2024

work page 2024

[42] [42]

Ple ˇcko and E

D. Ple ˇcko and E. Bareinboim. Causal fairness analysis.Foundations and Trends in Machine Learning, 17 (3):304–589, 2024

work page 2024

[43] [43]

Ple ˇcko and N

D. Ple ˇcko and N. Meinshausen. Fair data adaptation with quantile preservation.Journal of Machine Learning Research, 21:242, 2020

work page 2020

[44] [44]

Ple ˇcko, P

D. Ple ˇcko, P. Okanovi´c, S. Havaldar, T. Hoefler, and E. Bareinboim. Epidemiology of large language models: A benchmark for observational distribution knowledge.arXiv preprint arXiv:2511.03070, 2025. URLhttps://arxiv.org/pdf/2511.03070

work page arXiv 2025

[45] [45]

S. SAMHSA. National Survey on Drug Use and Health (NSDUH).https://www.samhsa. gov/data/data-we-collect/nsduh-national-survey-drug-use-and-health,

work page

[46] [46]

Department of Health and Human Services

U.S. Department of Health and Human Services

work page

[47] [47]

J. Sanburn. Facebook thinks some native american names are inauthentic.Time, Feb. 14 2015. URLhttp://time.com/3710203/facebook-native-american-names/

work page arXiv 2015

[48] [48]

i’m sorry to hear that

E. M. Smith, M. Hall, M. Kambadur, E. Presani, and A. Williams. “i’m sorry to hear that”: Finding new biases in language models with a holistic descriptor dataset. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 9180–9211, 2022

work page 2022

[49] [49]

L. Sweeney. Discrimination in online ad delivery. Technical Report 2208240, SSRN, Jan. 28

work page

[50] [50]

URLhttp://dx.doi.org/10.2139/ssrn.2208240

work page doi:10.2139/ssrn.2208240

[51] [51]

L. T. Sweeney and C. Haney. The influence of race on sentencing: A meta-analytic review of experimental studies.Behavioral Sciences & the Law, 10(2):179–195, 1992

work page 1992

[52] [52]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mes- nard, B. Shahriari, A. Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Census Bureau

U.S. Census Bureau. American Community Survey 1-Year Estimates, 2023. Retrieved from https://data.census.gov/

work page 2023

[55] [55]

S. Wang, X. Cao, J. Zhang, Z. Yuan, S. Shan, X. Chen, and W. Gao. Vlbiasbench: A com- prehensive benchmark for evaluating bias in large vision-language model.arXiv preprint arXiv:2406.14194, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

Y . Wu, L. Zhang, X. Wu, and H. Tong. Pc-fairness: A unified framework for measuring causality-based fairness.Advances in neural information processing systems, 32, 2019

work page 2019

[57] [57]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Zhang and E

J. Zhang and E. Bareinboim. Equality of opportunity in classification: A causal approach. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3671–3681, Montreal, Canada,

work page

[59] [59]

Curran Associates, Inc

work page

[60] [60]

Zhang and E

J. Zhang and E. Bareinboim. Fairness in decision-making—the causal explanation formula. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018. 13 Supplementary Material forCausal Bias Detection in Generative Artificial Intelligence The source code for reproducing the results can be found in the anonymized code repository https:...

work page 2018

[61] [61]

Write a single narrative enclosed in <story>...</story>

work page

[62] [62]

Do NOT include headings, lists, analysis, or any text outside the tags

work page

[63] [63]

Mention ALL facts given below exactly once (age, sex, race, education, income, marijuana use last month)

work page

[64] [64]

Keep it under 200 words. known facts to be mentioned: - age = 30-34 years - sex = female - race = White unknown facts to be mentioned: - edu (possible values: <= 8th grade, Some high school, High school graduate, Some college no degree, Associate degree, Bachelor’s or higher) - income (possible values: < $10,000, $10,000 - $19,999, $20,000 - $29,999, $30,...

work page 2023