Fairness Audits of Institutional Risk Models in Deployed ML Pipelines

Angelina Zhai; Dipto Das; Kelly McConvey; Maya Ghai; Rosa Lee; Shion Guha

arxiv: 2604.19468 · v1 · submitted 2026-04-21 · 💻 cs.CY · cs.AI· cs.HC

Fairness Audits of Institutional Risk Models in Deployed ML Pipelines

Kelly McConvey , Dipto Das , Maya Ghai , Angelina Zhai , Rosa Lee , Shion Guha This is my paper

Pith reviewed 2026-05-10 01:31 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.HC

keywords fairness auditearly warning systeminstitutional MLrisk predictionpost-processing biaseducational disparitiesstudent successdisparity analysis

0 comments

The pith

An audit of a deployed college early warning system reveals younger male and international students are over-flagged for support while older and female students with similar risks are under-identified.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replicates a real early warning system at a college using its training data and design specifications. It then measures disparities by gender, age, and residency status across the training data, the model's predictions, and the post-processing step that turns probabilities into risk tiers. The audit finds systematic misallocation: younger male and international students get flagged more often even when many succeed, while older and female students with comparable dropout probabilities are missed. This matters because the flags decide who receives extra support, so the system can steer resources away from students who need them. The work shows how disparities build at each stage and argues that audits must check both statistical fairness and whether the risk score actually tracks real outcomes.

Core claim

The central claim is that the replica of the Early Warning System reveals systematic misallocation of support: younger, male, and international students are disproportionately flagged for intervention even when many ultimately succeed academically, whereas older and female students with similar dropout probabilities are under-identified. Post-processing by collapsing probabilities into percentile-based risk tiers amplifies these disparities. The audit evaluates the full pipeline from training data through model predictions to post-processing using standard fairness metrics, and concludes that disparities emerge and compound across stages while also calling for attention to construct validity

What carries the argument

The replica model of the deployed Early Warning System, built from institutional training data and design specifications, which enables evaluation of standard fairness metrics at the data, prediction, and post-processing stages.

If this is right

Disparities in flagging appear in the training data, grow in model predictions, and are amplified by post-processing.
Support resources can be allocated based on group membership rather than actual likelihood of dropout or success.
Auditing only the model stage misses how post-processing steps compound bias.
A replicable method exists for checking full pipelines in other institutional ML systems.
Fairness evaluations should include whether the risk construct predicts real student outcomes, not just group balance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeating the replica audit at other colleges would test whether the same group patterns appear in different settings.
Using fixed probability thresholds instead of percentile tiers for flagging could limit the amplification effect.
The results point to possible changes in data collection or feature selection that might reduce group differences in future models.
Similar audits could be applied to risk models used in other domains where support is allocated by predicted need.

Load-bearing premise

The replica model accurately reproduces the behavior of the live deployed Early Warning System.

What would settle it

Re-running the fairness analysis on direct outputs from the actual deployed system and finding no significant disparities by gender, age, or residency status would challenge the claim.

Figures

Figures reproduced from arXiv: 2604.19468 by Angelina Zhai, Dipto Das, Kelly McConvey, Maya Ghai, Rosa Lee, Shion Guha.

**Figure 1.** Figure 1: Distribution of success prediction probabilities from the EWS. The orange dashed line (≈0.4) marks the High Risk threshold and the green dashed line (≈0.8) marks the Medium Risk threshold [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

read the original abstract

Fairness audits of institutional risk models are critical for understanding how deployed machine learning pipelines allocate resources. Drawing on multi-year collaboration with Centennial College, where our prior ethnographic work introduced the ASP-HEI Cycle, we present a replica-based audit of a deployed Early Warning System (EWS), replicating its model using institutional training data and design specifications. We evaluate disparities by gender, age, and residency status across the full pipeline (training data, model predictions, and post-processing) using standard fairness metrics. Our audit reveals systematic misallocation: younger, male, and international students are disproportionately flagged for support, even when many ultimately succeed, while older and female students with comparable dropout risk are under-identified. Post-processing amplifies these disparities by collapsing heterogeneous probabilities into percentile-based risk tiers. This work provides a replicable methodology for auditing institutional ML systems and shows how disparities emerge and compound across stages, highlighting the importance of evaluating construct validity alongside statistical fairness. It contributes one empirical thread to a broader program investigating algorithms, student data, and power in higher education.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper audits a real higher-ed early warning system and shows post-processing amplifies group disparities in flagging, but the replica model lacks any direct validation against the deployed system.

read the letter

The main point is straightforward: this audit of Centennial College's Early Warning System finds that younger, male, and international students get flagged for support more often than their outcomes justify, while older and female students with similar risks get missed, and the percentile-based risk tiers after the model make those gaps larger. They built a replica from the institution's training data and specs, then checked disparities across training data, predictions, and post-processing using standard metrics like those for demographic parity or equalized odds by gender, age, and residency status.

Referee Report

2 major / 2 minor

Summary. The paper presents a replica-based fairness audit of a deployed Early Warning System (EWS) for student dropout risk at Centennial College. Using institutional training data and design specifications, the authors replicate the model and evaluate disparities by gender, age, and residency status across the training data, model predictions, and post-processing stages with standard fairness metrics. They report systematic misallocation, with younger, male, and international students over-flagged despite many succeeding, while older and female students with comparable risk are under-identified; post-processing via percentile tiers is said to amplify these gaps. The work offers a replicable auditing methodology and stresses evaluating construct validity alongside statistical fairness.

Significance. If the replica accurately reproduces the deployed system, the audit provides a concrete empirical demonstration of how disparities can emerge and compound across stages of an institutional ML pipeline in higher education. The emphasis on full-pipeline evaluation and the replicable methodology are strengths that could inform similar audits elsewhere. The collaboration with the institution and grounding in prior ethnographic work (ASP-HEI Cycle) add practical relevance, though the single-institution scope limits generalizability.

major comments (2)

[§4.2] §4.2 (Replica Model Construction): The central claims of systematic misallocation depend on the replica faithfully reproducing the deployed EWS. The section describes construction from training data and specifications but reports no quantitative validation (e.g., no Pearson correlation, MAE, or side-by-side risk-score comparison on a held-out cohort against live EWS outputs). Any unstated differences in feature engineering or missing-value handling would directly affect the measured group disparities.
[§5] §5 (Results and Disparity Analysis): The findings of over-flagging for younger/male/international students and under-identification for older/female students are presented qualitatively without specific effect sizes, disparate-impact ratios, or statistical significance tests on the key subgroups. This makes it difficult to assess the magnitude and robustness of the misallocation claim.

minor comments (2)

[§3] The abstract and §3 mention use of 'standard fairness metrics' but do not list the exact metrics (e.g., demographic parity, equalized odds) or their formulas; adding an explicit enumeration would improve clarity.
[Figure 2] Figure 2 (pipeline diagram) would benefit from labeling the exact percentile thresholds used in post-processing, as these are identified as free parameters that amplify disparities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which help strengthen the rigor of our audit methodology. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§4.2] §4.2 (Replica Model Construction): The central claims of systematic misallocation depend on the replica faithfully reproducing the deployed EWS. The section describes construction from training data and specifications but reports no quantitative validation (e.g., no Pearson correlation, MAE, or side-by-side risk-score comparison on a held-out cohort against live EWS outputs). Any unstated differences in feature engineering or missing-value handling would directly affect the measured group disparities.

Authors: We agree that explicit quantitative validation would increase confidence in the replica's fidelity. Due to institutional data access restrictions, we were provided only with the training dataset and design specifications rather than live EWS outputs on a held-out cohort, precluding direct side-by-side comparisons such as Pearson correlation or MAE. The replica was built by strictly following the documented feature engineering, missing-value handling, and model architecture described in §4.2. We will revise the manuscript to state these constraints explicitly, add a limitations subsection on replica fidelity, and report any available internal consistency checks (e.g., matching of subgroup risk-score distributions to institutional reports). revision: partial
Referee: [§5] §5 (Results and Disparity Analysis): The findings of over-flagging for younger/male/international students and under-identification for older/female students are presented qualitatively without specific effect sizes, disparate-impact ratios, or statistical significance tests on the key subgroups. This makes it difficult to assess the magnitude and robustness of the misallocation claim.

Authors: We accept this critique and will strengthen the quantitative presentation. The revised §5 will include effect sizes (odds ratios and risk ratios for flagging by subgroup), disparate-impact ratios (positive-rate ratios relative to the reference group), and statistical significance tests (chi-squared tests on proportions with p-values and confidence intervals). These metrics will be added to the text, tables, and figures so readers can directly evaluate the magnitude and robustness of the reported disparities. revision: yes

Circularity Check

0 steps flagged

Empirical replica-based audit exhibits no circularity

full rationale

The paper constructs a replica EWS from institutional training data and design specifications, then applies standard fairness metrics to its outputs across pipeline stages to measure disparities by gender, age, and residency. This is a direct empirical computation on external data, not a derivation or prediction that reduces to fitted parameters or self-referential definitions by construction. The single self-reference to prior ethnographic work (introducing the ASP-HEI Cycle) frames the collaboration but does not carry the load of the central claims, which rest on the computed metrics and observed misallocations. No equations, uniqueness theorems, ansatzes, or renamings are present that would create circular reduction. The analysis remains falsifiable against the institutional dataset and is self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the fidelity of the model replica to the deployed system and the appropriateness of applying standard fairness metrics to this educational context without additional validation of construct validity.

free parameters (1)

percentile thresholds for risk tiers
Post-processing collapses probabilities into percentile-based risk tiers; specific percentile cutoffs are chosen and can affect measured disparities.

axioms (1)

domain assumption The replicated model using institutional training data and design specifications faithfully represents the deployed EWS predictions and post-processing
The entire audit is predicated on successful replication of the original model behavior.

pith-pipeline@v0.9.0 · 5499 in / 1257 out tokens · 69504 ms · 2026-05-10T01:31:09.033649+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Risk, Retention, and the Algorithmic Institu- tion: Artificial Intelligence as a Policy Response to Higher Education in Crisis

K. McConvey, M. Ghai, R. Lee, and S. Guha. “Risk, Retention, and the Algorithmic Institu- tion: Artificial Intelligence as a Policy Response to Higher Education in Crisis”. In:Canadian Public Policy / Analyse de politiques(2026).doi:10.3138/cpp.2025-030

work page doi:10.3138/cpp.2025-030 2026
[2]

“This Is Not a Data Problem

K. McConvey and S. Guha. ““This Is Not a Data Problem”: Algorithms and Power in Public Higher Education in Canada”. In:Proceedings of the CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, 2024, pp. 1–14

work page 2024
[3]

Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency , location =

I. D. Raji, A. Smart, R. N. White, M. Mitchell, T. Gebru, B. Hutchinson, J. Smith-Loud, D. Theron, and P. Barnes. “Closing the AI Accountability Gap: Defining an End-to-End Frame- work for Internal Algorithmic Auditing”. In:Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 2020, pp. 33–4...

work page doi:10.1145/3351095.3372873 2020
[4]

A Framework of High-Stakes Algorithmic Decision-Making for the Public Sector Developed through a Case Study of Child- Welfare

D. Saxena, K. Badillo-Urquiola, P. J. Wisniewski, and S. Guha. “A Framework of High-Stakes Algorithmic Decision-Making for the Public Sector Developed through a Case Study of Child- Welfare”. In:Proceedings of the ACM on Human-Computer Interaction. Vol. 5. CSCW2. 2021, pp. 1–41.doi:10.1145/3476089

work page doi:10.1145/3476089 2021
[5]

InProceedings of the CHI Conference on Human Factors in Computing Systems

E. S. Y. Moon and S. Guha. “A Human-Centered Review of Algorithms in Homelessness Research”. In:Proceedings of the CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, 2024, pp. 1–15.doi:10.1145/3613904.3642392

work page doi:10.1145/3613904.3642392 2024
[6]

They Shall Be Fair, Transparent, and Robust: Auditing Learning Analytics Systems

K. Simbeck. “They Shall Be Fair, Transparent, and Robust: Auditing Learning Analytics Systems”. In:AI and Ethics4.2 (2024), pp. 555–571

work page 2024
[7]

Science , author =

Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations”. In:Science366.6464 (2019), pp. 447– 453.doi:10.1126/science.aax2342

work page doi:10.1126/science.aax2342 2019
[8]

Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =

S. Passi and S. Barocas. “Problem Formulation and Fairness”. In:Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency. Association for Computing Ma- chinery, 2019, pp. 39–48.doi:10.1145/3287560.3287567

work page doi:10.1145/3287560.3287567 2019
[9]

Jacobs and Hanna Wallach

A. Z. Jacobs and H. Wallach. “Measurement and Fairness”. In:Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Ma- chinery, 2021, pp. 375–385.doi:10.1145/3442188.3445901

work page doi:10.1145/3442188.3445901 2021
[10]

Perdomo, Tolani Britton, Moritz Hardt, and Rediet Abebe

J. C. Perdomo, T. Britton, M. Hardt, and R. Abebe. “Difficult Lessons on Social Prediction from Wisconsin Public Schools”. In:arXiv preprint(2023). arXiv:2304.06205

work page arXiv 2023
[11]

A Human-Centered Review of Algorithms in Decision-MakinginHigherEducation

K. McConvey, S. Guha, and A. Kuzminykh. “A Human-Centered Review of Algorithms in Decision-MakinginHigherEducation”.In:Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, 2023, pp. 1–15.doi: 10.1145/3544548.3580658

work page doi:10.1145/3544548.3580658 2023
[12]

Balancing Fairness: Unveiling the Potential of SMOTE-Driven Oversampling in AI Model Enhancement

M. A. Kabir, M. U. Ahmed, S. Begum, S. Barua, and M. R. Islam. “Balancing Fairness: Unveiling the Potential of SMOTE-Driven Oversampling in AI Model Enhancement”. In: Proceedings of the 2024 9th International Conference on Machine Learning Technologies. ICMLT ’24. New York, NY, USA: Association for Computing Machinery, Sept. 2024, pp. 21– 29.isbn: 979-8-4...

work page doi:10.1145/3674029.3674034 2024
[13]

In: Proceedings of the 3rd Innovations in Theoretica l Computer Science Conference On - ITCS ’12, pp

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. “Fairness through Awareness”. In: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. Association for Computing Machinery, 2012, pp. 214–226.doi:10.1145/2090236.2090255. AppendixA.Supplementary Tables and Figures This appendix contains detailed fairness metrics and suppl...

work page doi:10.1145/2090236.2090255 2012

[1] [1]

Risk, Retention, and the Algorithmic Institu- tion: Artificial Intelligence as a Policy Response to Higher Education in Crisis

K. McConvey, M. Ghai, R. Lee, and S. Guha. “Risk, Retention, and the Algorithmic Institu- tion: Artificial Intelligence as a Policy Response to Higher Education in Crisis”. In:Canadian Public Policy / Analyse de politiques(2026).doi:10.3138/cpp.2025-030

work page doi:10.3138/cpp.2025-030 2026

[2] [2]

“This Is Not a Data Problem

K. McConvey and S. Guha. ““This Is Not a Data Problem”: Algorithms and Power in Public Higher Education in Canada”. In:Proceedings of the CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, 2024, pp. 1–14

work page 2024

[3] [3]

Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency , location =

I. D. Raji, A. Smart, R. N. White, M. Mitchell, T. Gebru, B. Hutchinson, J. Smith-Loud, D. Theron, and P. Barnes. “Closing the AI Accountability Gap: Defining an End-to-End Frame- work for Internal Algorithmic Auditing”. In:Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 2020, pp. 33–4...

work page doi:10.1145/3351095.3372873 2020

[4] [4]

A Framework of High-Stakes Algorithmic Decision-Making for the Public Sector Developed through a Case Study of Child- Welfare

D. Saxena, K. Badillo-Urquiola, P. J. Wisniewski, and S. Guha. “A Framework of High-Stakes Algorithmic Decision-Making for the Public Sector Developed through a Case Study of Child- Welfare”. In:Proceedings of the ACM on Human-Computer Interaction. Vol. 5. CSCW2. 2021, pp. 1–41.doi:10.1145/3476089

work page doi:10.1145/3476089 2021

[5] [5]

InProceedings of the CHI Conference on Human Factors in Computing Systems

E. S. Y. Moon and S. Guha. “A Human-Centered Review of Algorithms in Homelessness Research”. In:Proceedings of the CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, 2024, pp. 1–15.doi:10.1145/3613904.3642392

work page doi:10.1145/3613904.3642392 2024

[6] [6]

They Shall Be Fair, Transparent, and Robust: Auditing Learning Analytics Systems

K. Simbeck. “They Shall Be Fair, Transparent, and Robust: Auditing Learning Analytics Systems”. In:AI and Ethics4.2 (2024), pp. 555–571

work page 2024

[7] [7]

Science , author =

Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations”. In:Science366.6464 (2019), pp. 447– 453.doi:10.1126/science.aax2342

work page doi:10.1126/science.aax2342 2019

[8] [8]

Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =

S. Passi and S. Barocas. “Problem Formulation and Fairness”. In:Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency. Association for Computing Ma- chinery, 2019, pp. 39–48.doi:10.1145/3287560.3287567

work page doi:10.1145/3287560.3287567 2019

[9] [9]

Jacobs and Hanna Wallach

A. Z. Jacobs and H. Wallach. “Measurement and Fairness”. In:Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Ma- chinery, 2021, pp. 375–385.doi:10.1145/3442188.3445901

work page doi:10.1145/3442188.3445901 2021

[10] [10]

Perdomo, Tolani Britton, Moritz Hardt, and Rediet Abebe

J. C. Perdomo, T. Britton, M. Hardt, and R. Abebe. “Difficult Lessons on Social Prediction from Wisconsin Public Schools”. In:arXiv preprint(2023). arXiv:2304.06205

work page arXiv 2023

[11] [11]

A Human-Centered Review of Algorithms in Decision-MakinginHigherEducation

K. McConvey, S. Guha, and A. Kuzminykh. “A Human-Centered Review of Algorithms in Decision-MakinginHigherEducation”.In:Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, 2023, pp. 1–15.doi: 10.1145/3544548.3580658

work page doi:10.1145/3544548.3580658 2023

[12] [12]

Balancing Fairness: Unveiling the Potential of SMOTE-Driven Oversampling in AI Model Enhancement

M. A. Kabir, M. U. Ahmed, S. Begum, S. Barua, and M. R. Islam. “Balancing Fairness: Unveiling the Potential of SMOTE-Driven Oversampling in AI Model Enhancement”. In: Proceedings of the 2024 9th International Conference on Machine Learning Technologies. ICMLT ’24. New York, NY, USA: Association for Computing Machinery, Sept. 2024, pp. 21– 29.isbn: 979-8-4...

work page doi:10.1145/3674029.3674034 2024

[13] [13]

In: Proceedings of the 3rd Innovations in Theoretica l Computer Science Conference On - ITCS ’12, pp

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. “Fairness through Awareness”. In: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. Association for Computing Machinery, 2012, pp. 214–226.doi:10.1145/2090236.2090255. AppendixA.Supplementary Tables and Figures This appendix contains detailed fairness metrics and suppl...

work page doi:10.1145/2090236.2090255 2012