Fairness Audits of Institutional Risk Models in Deployed ML Pipelines
Pith reviewed 2026-05-10 01:31 UTC · model grok-4.3
The pith
An audit of a deployed college early warning system reveals younger male and international students are over-flagged for support while older and female students with similar risks are under-identified.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the replica of the Early Warning System reveals systematic misallocation of support: younger, male, and international students are disproportionately flagged for intervention even when many ultimately succeed academically, whereas older and female students with similar dropout probabilities are under-identified. Post-processing by collapsing probabilities into percentile-based risk tiers amplifies these disparities. The audit evaluates the full pipeline from training data through model predictions to post-processing using standard fairness metrics, and concludes that disparities emerge and compound across stages while also calling for attention to construct validity
What carries the argument
The replica model of the deployed Early Warning System, built from institutional training data and design specifications, which enables evaluation of standard fairness metrics at the data, prediction, and post-processing stages.
If this is right
- Disparities in flagging appear in the training data, grow in model predictions, and are amplified by post-processing.
- Support resources can be allocated based on group membership rather than actual likelihood of dropout or success.
- Auditing only the model stage misses how post-processing steps compound bias.
- A replicable method exists for checking full pipelines in other institutional ML systems.
- Fairness evaluations should include whether the risk construct predicts real student outcomes, not just group balance.
Where Pith is reading between the lines
- Repeating the replica audit at other colleges would test whether the same group patterns appear in different settings.
- Using fixed probability thresholds instead of percentile tiers for flagging could limit the amplification effect.
- The results point to possible changes in data collection or feature selection that might reduce group differences in future models.
- Similar audits could be applied to risk models used in other domains where support is allocated by predicted need.
Load-bearing premise
The replica model accurately reproduces the behavior of the live deployed Early Warning System.
What would settle it
Re-running the fairness analysis on direct outputs from the actual deployed system and finding no significant disparities by gender, age, or residency status would challenge the claim.
Figures
read the original abstract
Fairness audits of institutional risk models are critical for understanding how deployed machine learning pipelines allocate resources. Drawing on multi-year collaboration with Centennial College, where our prior ethnographic work introduced the ASP-HEI Cycle, we present a replica-based audit of a deployed Early Warning System (EWS), replicating its model using institutional training data and design specifications. We evaluate disparities by gender, age, and residency status across the full pipeline (training data, model predictions, and post-processing) using standard fairness metrics. Our audit reveals systematic misallocation: younger, male, and international students are disproportionately flagged for support, even when many ultimately succeed, while older and female students with comparable dropout risk are under-identified. Post-processing amplifies these disparities by collapsing heterogeneous probabilities into percentile-based risk tiers. This work provides a replicable methodology for auditing institutional ML systems and shows how disparities emerge and compound across stages, highlighting the importance of evaluating construct validity alongside statistical fairness. It contributes one empirical thread to a broader program investigating algorithms, student data, and power in higher education.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a replica-based fairness audit of a deployed Early Warning System (EWS) for student dropout risk at Centennial College. Using institutional training data and design specifications, the authors replicate the model and evaluate disparities by gender, age, and residency status across the training data, model predictions, and post-processing stages with standard fairness metrics. They report systematic misallocation, with younger, male, and international students over-flagged despite many succeeding, while older and female students with comparable risk are under-identified; post-processing via percentile tiers is said to amplify these gaps. The work offers a replicable auditing methodology and stresses evaluating construct validity alongside statistical fairness.
Significance. If the replica accurately reproduces the deployed system, the audit provides a concrete empirical demonstration of how disparities can emerge and compound across stages of an institutional ML pipeline in higher education. The emphasis on full-pipeline evaluation and the replicable methodology are strengths that could inform similar audits elsewhere. The collaboration with the institution and grounding in prior ethnographic work (ASP-HEI Cycle) add practical relevance, though the single-institution scope limits generalizability.
major comments (2)
- [§4.2] §4.2 (Replica Model Construction): The central claims of systematic misallocation depend on the replica faithfully reproducing the deployed EWS. The section describes construction from training data and specifications but reports no quantitative validation (e.g., no Pearson correlation, MAE, or side-by-side risk-score comparison on a held-out cohort against live EWS outputs). Any unstated differences in feature engineering or missing-value handling would directly affect the measured group disparities.
- [§5] §5 (Results and Disparity Analysis): The findings of over-flagging for younger/male/international students and under-identification for older/female students are presented qualitatively without specific effect sizes, disparate-impact ratios, or statistical significance tests on the key subgroups. This makes it difficult to assess the magnitude and robustness of the misallocation claim.
minor comments (2)
- [§3] The abstract and §3 mention use of 'standard fairness metrics' but do not list the exact metrics (e.g., demographic parity, equalized odds) or their formulas; adding an explicit enumeration would improve clarity.
- [Figure 2] Figure 2 (pipeline diagram) would benefit from labeling the exact percentile thresholds used in post-processing, as these are identified as free parameters that amplify disparities.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which help strengthen the rigor of our audit methodology. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Replica Model Construction): The central claims of systematic misallocation depend on the replica faithfully reproducing the deployed EWS. The section describes construction from training data and specifications but reports no quantitative validation (e.g., no Pearson correlation, MAE, or side-by-side risk-score comparison on a held-out cohort against live EWS outputs). Any unstated differences in feature engineering or missing-value handling would directly affect the measured group disparities.
Authors: We agree that explicit quantitative validation would increase confidence in the replica's fidelity. Due to institutional data access restrictions, we were provided only with the training dataset and design specifications rather than live EWS outputs on a held-out cohort, precluding direct side-by-side comparisons such as Pearson correlation or MAE. The replica was built by strictly following the documented feature engineering, missing-value handling, and model architecture described in §4.2. We will revise the manuscript to state these constraints explicitly, add a limitations subsection on replica fidelity, and report any available internal consistency checks (e.g., matching of subgroup risk-score distributions to institutional reports). revision: partial
-
Referee: [§5] §5 (Results and Disparity Analysis): The findings of over-flagging for younger/male/international students and under-identification for older/female students are presented qualitatively without specific effect sizes, disparate-impact ratios, or statistical significance tests on the key subgroups. This makes it difficult to assess the magnitude and robustness of the misallocation claim.
Authors: We accept this critique and will strengthen the quantitative presentation. The revised §5 will include effect sizes (odds ratios and risk ratios for flagging by subgroup), disparate-impact ratios (positive-rate ratios relative to the reference group), and statistical significance tests (chi-squared tests on proportions with p-values and confidence intervals). These metrics will be added to the text, tables, and figures so readers can directly evaluate the magnitude and robustness of the reported disparities. revision: yes
Circularity Check
Empirical replica-based audit exhibits no circularity
full rationale
The paper constructs a replica EWS from institutional training data and design specifications, then applies standard fairness metrics to its outputs across pipeline stages to measure disparities by gender, age, and residency. This is a direct empirical computation on external data, not a derivation or prediction that reduces to fitted parameters or self-referential definitions by construction. The single self-reference to prior ethnographic work (introducing the ASP-HEI Cycle) frames the collaboration but does not carry the load of the central claims, which rest on the computed metrics and observed misallocations. No equations, uniqueness theorems, ansatzes, or renamings are present that would create circular reduction. The analysis remains falsifiable against the institutional dataset and is self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- percentile thresholds for risk tiers
axioms (1)
- domain assumption The replicated model using institutional training data and design specifications faithfully represents the deployed EWS predictions and post-processing
Reference graph
Works this paper leans on
-
[1]
K. McConvey, M. Ghai, R. Lee, and S. Guha. “Risk, Retention, and the Algorithmic Institu- tion: Artificial Intelligence as a Policy Response to Higher Education in Crisis”. In:Canadian Public Policy / Analyse de politiques(2026).doi:10.3138/cpp.2025-030
-
[2]
K. McConvey and S. Guha. ““This Is Not a Data Problem”: Algorithms and Power in Public Higher Education in Canada”. In:Proceedings of the CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, 2024, pp. 1–14
work page 2024
-
[3]
Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency , location =
I. D. Raji, A. Smart, R. N. White, M. Mitchell, T. Gebru, B. Hutchinson, J. Smith-Loud, D. Theron, and P. Barnes. “Closing the AI Accountability Gap: Defining an End-to-End Frame- work for Internal Algorithmic Auditing”. In:Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 2020, pp. 33–4...
-
[4]
D. Saxena, K. Badillo-Urquiola, P. J. Wisniewski, and S. Guha. “A Framework of High-Stakes Algorithmic Decision-Making for the Public Sector Developed through a Case Study of Child- Welfare”. In:Proceedings of the ACM on Human-Computer Interaction. Vol. 5. CSCW2. 2021, pp. 1–41.doi:10.1145/3476089
-
[5]
InProceedings of the CHI Conference on Human Factors in Computing Systems
E. S. Y. Moon and S. Guha. “A Human-Centered Review of Algorithms in Homelessness Research”. In:Proceedings of the CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, 2024, pp. 1–15.doi:10.1145/3613904.3642392
-
[6]
They Shall Be Fair, Transparent, and Robust: Auditing Learning Analytics Systems
K. Simbeck. “They Shall Be Fair, Transparent, and Robust: Auditing Learning Analytics Systems”. In:AI and Ethics4.2 (2024), pp. 555–571
work page 2024
-
[7]
Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations”. In:Science366.6464 (2019), pp. 447– 453.doi:10.1126/science.aax2342
-
[8]
Proceedings of the Conference on Fairness, Accountability, and Transparency , pages =
S. Passi and S. Barocas. “Problem Formulation and Fairness”. In:Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency. Association for Computing Ma- chinery, 2019, pp. 39–48.doi:10.1145/3287560.3287567
-
[9]
A. Z. Jacobs and H. Wallach. “Measurement and Fairness”. In:Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Ma- chinery, 2021, pp. 375–385.doi:10.1145/3442188.3445901
-
[10]
Perdomo, Tolani Britton, Moritz Hardt, and Rediet Abebe
J. C. Perdomo, T. Britton, M. Hardt, and R. Abebe. “Difficult Lessons on Social Prediction from Wisconsin Public Schools”. In:arXiv preprint(2023). arXiv:2304.06205
-
[11]
A Human-Centered Review of Algorithms in Decision-MakinginHigherEducation
K. McConvey, S. Guha, and A. Kuzminykh. “A Human-Centered Review of Algorithms in Decision-MakinginHigherEducation”.In:Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, 2023, pp. 1–15.doi: 10.1145/3544548.3580658
-
[12]
Balancing Fairness: Unveiling the Potential of SMOTE-Driven Oversampling in AI Model Enhancement
M. A. Kabir, M. U. Ahmed, S. Begum, S. Barua, and M. R. Islam. “Balancing Fairness: Unveiling the Potential of SMOTE-Driven Oversampling in AI Model Enhancement”. In: Proceedings of the 2024 9th International Conference on Machine Learning Technologies. ICMLT ’24. New York, NY, USA: Association for Computing Machinery, Sept. 2024, pp. 21– 29.isbn: 979-8-4...
-
[13]
In: Proceedings of the 3rd Innovations in Theoretica l Computer Science Conference On - ITCS ’12, pp
C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. “Fairness through Awareness”. In: Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. Association for Computing Machinery, 2012, pp. 214–226.doi:10.1145/2090236.2090255. AppendixA.Supplementary Tables and Figures This appendix contains detailed fairness metrics and suppl...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.