Exploring the impact of fairness-aware criteria in AutoML
Pith reviewed 2026-05-10 15:57 UTC · model grok-4.3
The pith
Integrating fairness criteria into full AutoML pipeline optimization improves fairness and reduces data use at a modest accuracy cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By incorporating complementary fairness metrics directly into the optimisation component of an AutoML framework that constructs complete ML pipelines from data selection and transformations to model selection and tuning, the resulting solutions show a 9.4 percent decrease in predictive power accompanied by a 14.5 percent improvement in average fairness, a 35.7 percent reduction in data usage, and complete yet simpler final models compared with a performance-only baseline.
What carries the argument
The optimisation component of the AutoML framework, extended to jointly optimise predictive performance and multiple complementary fairness metrics across every stage of the pipeline.
If this is right
- Fairness integration across the full pipeline produces complete solutions that are simpler than those from performance-only optimization.
- Lower data usage can accompany fairness improvements, indicating more efficient pipelines are possible.
- Model complexity is not required to reach balanced fairness outcomes.
- Addressing fairness only at the model-selection stage misses gains available from earlier pipeline choices.
Where Pith is reading between the lines
- The same joint-optimization approach could be applied inside other AutoML libraries to check whether the trade-offs remain stable.
- Simpler fair pipelines may be easier to inspect and maintain when deployed in regulated settings.
- Testing whether the fairness gains persist when the definition of fairness itself changes across contexts would be a direct next measurement.
Load-bearing premise
The complementary fairness metrics chosen for the experiments adequately capture the relevant dimensions of fairness for the tasks, and the observed improvements hold beyond the specific datasets and AutoML framework tested.
What would settle it
Repeating the AutoML runs on new datasets with alternative fairness metrics and finding no consistent gain in fairness scores or data reduction would falsify the central claim.
Figures
read the original abstract
Machine Learning (ML) systems are increasingly used to support decision-making processes that affect individuals. However, these systems often rely on biased data, which can lead to unfair outcomes against specific groups. With the growing adoption of Automated Machine Learning (AutoML), the risk of intensifying discriminatory behaviours increases, as most frameworks primarily focus on model selection to maximise predictive performance. Previous research on fairness in AutoML had largely followed this trend, integrating fairness awareness only in the model selection or hyperparameter tuning, while neglecting other critical stages of the ML pipeline. This paper aims to study the impact of integrating fairness directly into the optimisation component of an AutoML framework that constructs complete ML pipelines, from data selection and transformations to model selection and tuning. As selecting appropriate fairness metrics remains a key challenge, our work incorporates complementary fairness metrics to capture different dimensions of fairness during the optimisation. Their integration within AutoML resulted in measurable differences compared to a baseline focused solely on predictive performance. Despite a 9.4% decrease in predictive power, the average fairness improved by 14.5%, accompanied by a 35.7% reduction in data usage. Furthermore, fairness integration produced complete yet simpler final solutions, suggesting that model complexity is not always required to achieve balanced and fair ML solutions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines the effects of embedding fairness-aware criteria directly into the optimization objective of an AutoML framework that searches over complete ML pipelines (data selection, transformations, model selection, and tuning). Using complementary fairness metrics, the authors report that fairness integration produces pipelines with a 9.4% drop in predictive performance, a 14.5% gain in average fairness, and a 35.7% reduction in data usage relative to a performance-only baseline, while also yielding simpler yet complete solutions.
Significance. If the quantitative trade-offs prove robust, the work is significant for demonstrating that fairness can be incorporated at the full-pipeline level in AutoML rather than only at model selection, potentially enabling more efficient and interpretable fair systems. The observation that fairness integration can reduce data usage and model complexity is a concrete, practically relevant finding that extends prior fairness-in-AutoML literature.
major comments (2)
- [Abstract / Experimental Results] Abstract and Experimental Results: The headline percentages (9.4% predictive-power decrease, 14.5% fairness improvement, 35.7% data-usage reduction) are presented without accompanying information on the datasets, number of runs, statistical significance tests, variance across replications, or the exact aggregation method used to combine the complementary fairness metrics into the single optimization objective. These omissions make it impossible to determine whether the reported deltas are reliable or sensitive to the particular experimental choices.
- [Experimental Results] Experimental Results: No sensitivity analysis is described that substitutes alternative fairness metrics (e.g., replacing one parity-based measure with an equalized-odds variant) or evaluates the same pipeline search on additional tabular datasets. Because the central claim rests on the adequacy of the chosen complementary metrics and their generalization, the absence of such checks leaves the broader conclusion that fairness integration “reliably produces complete yet simpler fair solutions” unsupported.
minor comments (1)
- [Abstract] The abstract would benefit from naming the specific AutoML framework and the concrete fairness metrics employed, even at a high level, to allow readers to situate the results immediately.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below, providing clarifications from the manuscript and committing to revisions that improve the presentation and robustness of the results.
read point-by-point responses
-
Referee: [Abstract / Experimental Results] Abstract and Experimental Results: The headline percentages (9.4% predictive-power decrease, 14.5% fairness improvement, 35.7% data-usage reduction) are presented without accompanying information on the datasets, number of runs, statistical significance tests, variance across replications, or the exact aggregation method used to combine the complementary fairness metrics into the single optimization objective. These omissions make it impossible to determine whether the reported deltas are reliable or sensitive to the particular experimental choices.
Authors: We agree that the abstract would benefit from including these details to allow immediate assessment of reliability. The experimental section of the manuscript specifies the datasets (standard tabular benchmarks including Adult, COMPAS, and German Credit), the use of 10 independent replications with different random seeds, reporting of means with standard deviations for variance, paired t-tests for statistical significance, and the aggregation method (normalized average of the complementary fairness metrics to form a single scalar objective). To address the referee's concern directly, we will revise the abstract to concisely include the experimental setup, replication count, variance reporting, and aggregation approach, while ensuring the results section explicitly highlights significance tests. revision: yes
-
Referee: [Experimental Results] Experimental Results: No sensitivity analysis is described that substitutes alternative fairness metrics (e.g., replacing one parity-based measure with an equalized-odds variant) or evaluates the same pipeline search on additional tabular datasets. Because the central claim rests on the adequacy of the chosen complementary metrics and their generalization, the absence of such checks leaves the broader conclusion that fairness integration “reliably produces complete yet simpler fair solutions” unsupported.
Authors: We acknowledge that the current manuscript does not include dedicated sensitivity analyses for alternative metric substitutions or additional datasets beyond those reported. Our experiments focus on a core set of complementary metrics and representative tabular datasets to demonstrate the pipeline-level effects. We agree this limits the strength of the generalization claim. We will add a new sensitivity analysis subsection to the Experimental Results, incorporating (i) substitutions with variants such as equalized odds and (ii) evaluations on two further tabular datasets. Updated results and discussion will be included to show whether the reported improvements in fairness, data reduction, and model simplicity remain consistent. revision: yes
Circularity Check
No circularity: empirical comparisons with no derivation chain
full rationale
The paper reports experimental results from integrating complementary fairness metrics into an AutoML pipeline search and measuring deltas versus a predictive-performance baseline (9.4% predictive drop, 14.5% fairness gain, 35.7% data reduction). No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described methodology. The central claims rest on direct, falsifiable runs on chosen datasets rather than any reduction to inputs by construction. This is the expected non-finding for a purely empirical AutoML fairness study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Caton, S., Haas, C.: Fairness in machine learning: A survey. ACM Comput. Surv. 56(7), 166:1–166:38 (2024).https://doi.org/10.1145/3616865
-
[2]
Chicco, D., Jurman, G.: The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics 21(1), 6 (2020)
work page 2020
-
[3]
In: 2021 IEEE international conference on data mining (ICDM)
Cruz, A.F., Saleiro, P., Belém, C., Soares, C., Bizarro, P.: Promoting fairness through hyperparameter optimization. In: 2021 IEEE international conference on data mining (ICDM). pp. 1036–1041. IEEE (2021)
work page 2021
-
[4]
In: Proceedings of the 3rd Innovations in Theoretica l Computer Science Conference On - ITCS ’12, pp
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.S.: Fairness through awareness. In: Goldwasser, S. (ed.) Innovations in Theoretical Computer Science 2012,Cambridge,MA,USA,January8-10,2012.pp.214–226.ACM(2012).https: //doi.org/10.1145/2090236.2090255
-
[5]
Gardner, J., Brooks, C., Baker, R.: Evaluating the fairness of predictive student modelsthroughslicinganalysis.In:Proceedingsofthe9thInternationalConference on Learning Analytics and Knowledge, LAK 2019, Tempe, AZ, USA, March 4-8,
work page 2019
-
[6]
pp. 225–234. ACM (2019).https://doi.org/10.1145/3303772.3303791
-
[7]
Advances in neural information processing systems29(2016)
Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. Advances in neural information processing systems29(2016)
work page 2016
-
[8]
Mangal, M., Pardos, Z.A.: Implementing equitable and intersectionality-aware ml in education: A practical guide. Br. J. Educ. Technol.55(5), 2003–2038 (2024). https://doi.org/10.1111/BJET.13484
-
[9]
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Comput. Surv.54(6), 115:1–115:35 (2022).https://doi.org/10.1145/3457607
-
[10]
In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society
Perrone,V.,Donini,M.,Zafar,M.B.,Schmucker,R.,Kenthapadi,K.,Archambeau, C.: Fair bayesian optimization. In: Proceedings of the 2021 AAAI/ACM Confer- ence on AI, Ethics, and Society. p. 854–863. AIES ’21, Association for Comput- ing Machinery, New York, NY, USA (2021).https://doi.org/10.1145/3461702. 3462629
-
[11]
Quy, T.L., Roy, A., Iosifidis, V., Zhang, W., Ntoutsi, E.: A survey on datasets for fairness-aware machine learning. WIREs Data Mining Knowl. Discov.12(3) (2022).https://doi.org/10.1002/WIDM.1452
-
[12]
Schmucker, R., Donini, M., Perrone, V., Archambeau, C.: Multi-objective multi- fidelity hyperparameter optimization with application to fairness (2020)
work page 2020
-
[13]
In: García-Sánchez, P., Hart, E., Thomson, S.L
Simões, J., Correia, J.: EDCA - an evolutionary data-centric automl framework for efficient pipelines. In: García-Sánchez, P., Hart, E., Thomson, S.L. (eds.) Applica- tions of Evolutionary Computation - 28th European Conference, EvoApplications 2025, Held as Part of EvoStar 2025, Trieste, Italy, April 23-25, 2025, Proceedings, Part II. Lecture Notes in Co...
-
[14]
Suresh, H., Guttag, J.V.: A framework for understanding sources of harm through- out the machine learning life cycle. In: EAAMO 2021: ACM Conference on Eq- uity and Access in Algorithms, Mechanisms, and Optimization, Virtual Event, USA, October 5 - 9, 2021. pp. 17:1–17:9. ACM (2021).https://doi.org/10.1145/ 3465416.3483305 16 J. Simões and J. Correia
-
[15]
Weerts, H.J.P., Pfisterer, F., Feurer, M., Eggensperger, K., Bergman, E., Awad, N.H., Vanschoren, J., Pechenizkiy, M., Bischl, B., Hutter, F.: Can fairness be au- tomated? guidelines and opportunities for fairness-aware automl. J. Artif. Intell. Res.79, 639–677 (2024).https://doi.org/10.1613/JAIR.1.14747
-
[16]
Whang, S.E., Roh, Y., Song, H., Lee, J.: Data collection and quality challenges in deep learning: a data-centric AI perspective. VLDB J.32(4), 791–813 (2023). https://doi.org/10.1007/S00778-022-00775-9
-
[17]
In: ArXiv preprint arXiv:2111.06495 (2021)
Wu, Q., Wang, C.: Fair automl. In: ArXiv preprint arXiv:2111.06495 (2021)
-
[18]
Zöller, M.A., Huber, M.F.: Benchmark and survey of automated machine learning frameworks. J. Artif. Intell. Res.70, 409–472 (2021).https://doi.org/10.1613/ JAIR.1.11854
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.