pith. sign in

arxiv: 2604.10224 · v1 · submitted 2026-04-11 · 💻 cs.LG · cs.AI

Exploring the impact of fairness-aware criteria in AutoML

Pith reviewed 2026-05-10 15:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords AutoMLfairnessmachine learning pipelinesoptimizationpredictive performancedata usagemodel complexity
0
0 comments X

The pith

Integrating fairness criteria into full AutoML pipeline optimization improves fairness and reduces data use at a modest accuracy cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests what changes when fairness metrics are added directly to the optimization step inside an AutoML system that builds complete pipelines, including data selection and transformations. Rather than optimizing solely for predictive accuracy and adding fairness only later in model selection, the system now balances accuracy against several complementary fairness measures throughout the process. On the tested tasks this produced pipelines that were on average 14.5 percent fairer, used 35.7 percent less data, and ended up simpler, even though predictive performance fell by 9.4 percent. A reader would care because AutoML is now routinely applied to decisions that affect people, and the results suggest fairness can be achieved without requiring more complex or data-heavy solutions.

Core claim

By incorporating complementary fairness metrics directly into the optimisation component of an AutoML framework that constructs complete ML pipelines from data selection and transformations to model selection and tuning, the resulting solutions show a 9.4 percent decrease in predictive power accompanied by a 14.5 percent improvement in average fairness, a 35.7 percent reduction in data usage, and complete yet simpler final models compared with a performance-only baseline.

What carries the argument

The optimisation component of the AutoML framework, extended to jointly optimise predictive performance and multiple complementary fairness metrics across every stage of the pipeline.

If this is right

  • Fairness integration across the full pipeline produces complete solutions that are simpler than those from performance-only optimization.
  • Lower data usage can accompany fairness improvements, indicating more efficient pipelines are possible.
  • Model complexity is not required to reach balanced fairness outcomes.
  • Addressing fairness only at the model-selection stage misses gains available from earlier pipeline choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint-optimization approach could be applied inside other AutoML libraries to check whether the trade-offs remain stable.
  • Simpler fair pipelines may be easier to inspect and maintain when deployed in regulated settings.
  • Testing whether the fairness gains persist when the definition of fairness itself changes across contexts would be a direct next measurement.

Load-bearing premise

The complementary fairness metrics chosen for the experiments adequately capture the relevant dimensions of fairness for the tasks, and the observed improvements hold beyond the specific datasets and AutoML framework tested.

What would settle it

Repeating the AutoML runs on new datasets with alternative fairness metrics and finding no consistent gain in fairness scores or data reduction would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.10224 by Joana Sim\~oes, Jo\~ao Correia.

Figure 1
Figure 1. Figure 1: Relationship between fairness (average of DP [4], EO [6], ABROCA [5]) [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of the data selection techniques selected on the final solutions [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Frequency of each feature on the final solutions. It shows how many times [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of the classification models selected on the final solutions [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evolution of the performance and fairness metrics used in the Fitness [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

Machine Learning (ML) systems are increasingly used to support decision-making processes that affect individuals. However, these systems often rely on biased data, which can lead to unfair outcomes against specific groups. With the growing adoption of Automated Machine Learning (AutoML), the risk of intensifying discriminatory behaviours increases, as most frameworks primarily focus on model selection to maximise predictive performance. Previous research on fairness in AutoML had largely followed this trend, integrating fairness awareness only in the model selection or hyperparameter tuning, while neglecting other critical stages of the ML pipeline. This paper aims to study the impact of integrating fairness directly into the optimisation component of an AutoML framework that constructs complete ML pipelines, from data selection and transformations to model selection and tuning. As selecting appropriate fairness metrics remains a key challenge, our work incorporates complementary fairness metrics to capture different dimensions of fairness during the optimisation. Their integration within AutoML resulted in measurable differences compared to a baseline focused solely on predictive performance. Despite a 9.4% decrease in predictive power, the average fairness improved by 14.5%, accompanied by a 35.7% reduction in data usage. Furthermore, fairness integration produced complete yet simpler final solutions, suggesting that model complexity is not always required to achieve balanced and fair ML solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper examines the effects of embedding fairness-aware criteria directly into the optimization objective of an AutoML framework that searches over complete ML pipelines (data selection, transformations, model selection, and tuning). Using complementary fairness metrics, the authors report that fairness integration produces pipelines with a 9.4% drop in predictive performance, a 14.5% gain in average fairness, and a 35.7% reduction in data usage relative to a performance-only baseline, while also yielding simpler yet complete solutions.

Significance. If the quantitative trade-offs prove robust, the work is significant for demonstrating that fairness can be incorporated at the full-pipeline level in AutoML rather than only at model selection, potentially enabling more efficient and interpretable fair systems. The observation that fairness integration can reduce data usage and model complexity is a concrete, practically relevant finding that extends prior fairness-in-AutoML literature.

major comments (2)
  1. [Abstract / Experimental Results] Abstract and Experimental Results: The headline percentages (9.4% predictive-power decrease, 14.5% fairness improvement, 35.7% data-usage reduction) are presented without accompanying information on the datasets, number of runs, statistical significance tests, variance across replications, or the exact aggregation method used to combine the complementary fairness metrics into the single optimization objective. These omissions make it impossible to determine whether the reported deltas are reliable or sensitive to the particular experimental choices.
  2. [Experimental Results] Experimental Results: No sensitivity analysis is described that substitutes alternative fairness metrics (e.g., replacing one parity-based measure with an equalized-odds variant) or evaluates the same pipeline search on additional tabular datasets. Because the central claim rests on the adequacy of the chosen complementary metrics and their generalization, the absence of such checks leaves the broader conclusion that fairness integration “reliably produces complete yet simpler fair solutions” unsupported.
minor comments (1)
  1. [Abstract] The abstract would benefit from naming the specific AutoML framework and the concrete fairness metrics employed, even at a high level, to allow readers to situate the results immediately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below, providing clarifications from the manuscript and committing to revisions that improve the presentation and robustness of the results.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] Abstract and Experimental Results: The headline percentages (9.4% predictive-power decrease, 14.5% fairness improvement, 35.7% data-usage reduction) are presented without accompanying information on the datasets, number of runs, statistical significance tests, variance across replications, or the exact aggregation method used to combine the complementary fairness metrics into the single optimization objective. These omissions make it impossible to determine whether the reported deltas are reliable or sensitive to the particular experimental choices.

    Authors: We agree that the abstract would benefit from including these details to allow immediate assessment of reliability. The experimental section of the manuscript specifies the datasets (standard tabular benchmarks including Adult, COMPAS, and German Credit), the use of 10 independent replications with different random seeds, reporting of means with standard deviations for variance, paired t-tests for statistical significance, and the aggregation method (normalized average of the complementary fairness metrics to form a single scalar objective). To address the referee's concern directly, we will revise the abstract to concisely include the experimental setup, replication count, variance reporting, and aggregation approach, while ensuring the results section explicitly highlights significance tests. revision: yes

  2. Referee: [Experimental Results] Experimental Results: No sensitivity analysis is described that substitutes alternative fairness metrics (e.g., replacing one parity-based measure with an equalized-odds variant) or evaluates the same pipeline search on additional tabular datasets. Because the central claim rests on the adequacy of the chosen complementary metrics and their generalization, the absence of such checks leaves the broader conclusion that fairness integration “reliably produces complete yet simpler fair solutions” unsupported.

    Authors: We acknowledge that the current manuscript does not include dedicated sensitivity analyses for alternative metric substitutions or additional datasets beyond those reported. Our experiments focus on a core set of complementary metrics and representative tabular datasets to demonstrate the pipeline-level effects. We agree this limits the strength of the generalization claim. We will add a new sensitivity analysis subsection to the Experimental Results, incorporating (i) substitutions with variants such as equalized odds and (ii) evaluations on two further tabular datasets. Updated results and discussion will be included to show whether the reported improvements in fairness, data reduction, and model simplicity remain consistent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons with no derivation chain

full rationale

The paper reports experimental results from integrating complementary fairness metrics into an AutoML pipeline search and measuring deltas versus a predictive-performance baseline (9.4% predictive drop, 14.5% fairness gain, 35.7% data reduction). No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described methodology. The central claims rest on direct, falsifiable runs on chosen datasets rather than any reduction to inputs by construction. This is the expected non-finding for a purely empirical AutoML fairness study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is purely empirical and does not rely on mathematical derivations or new theoretical entities; it assumes standard definitions of the fairness metrics and the AutoML search procedure.

pith-pipeline@v0.9.0 · 5524 in / 1111 out tokens · 39150 ms · 2026-05-10T15:57:45.107117+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    doi: 10.1145/3616865

    Caton, S., Haas, C.: Fairness in machine learning: A survey. ACM Comput. Surv. 56(7), 166:1–166:38 (2024).https://doi.org/10.1145/3616865

  2. [2]

    BMC genomics 21(1), 6 (2020)

    Chicco, D., Jurman, G.: The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics 21(1), 6 (2020)

  3. [3]

    In: 2021 IEEE international conference on data mining (ICDM)

    Cruz, A.F., Saleiro, P., Belém, C., Soares, C., Bizarro, P.: Promoting fairness through hyperparameter optimization. In: 2021 IEEE international conference on data mining (ICDM). pp. 1036–1041. IEEE (2021)

  4. [4]

    In: Proceedings of the 3rd Innovations in Theoretica l Computer Science Conference On - ITCS ’12, pp

    Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.S.: Fairness through awareness. In: Goldwasser, S. (ed.) Innovations in Theoretical Computer Science 2012,Cambridge,MA,USA,January8-10,2012.pp.214–226.ACM(2012).https: //doi.org/10.1145/2090236.2090255

  5. [5]

    Gardner, J., Brooks, C., Baker, R.: Evaluating the fairness of predictive student modelsthroughslicinganalysis.In:Proceedingsofthe9thInternationalConference on Learning Analytics and Knowledge, LAK 2019, Tempe, AZ, USA, March 4-8,

  6. [6]

    pp. 225–234. ACM (2019).https://doi.org/10.1145/3303772.3303791

  7. [7]

    Advances in neural information processing systems29(2016)

    Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. Advances in neural information processing systems29(2016)

  8. [8]

    Mangal, M., Pardos, Z.A.: Implementing equitable and intersectionality-aware ml in education: A practical guide. Br. J. Educ. Technol.55(5), 2003–2038 (2024). https://doi.org/10.1111/BJET.13484

  9. [9]

    ACM Comput

    Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM Comput. Surv.54(6), 115:1–115:35 (2022).https://doi.org/10.1145/3457607

  10. [10]

    In: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society

    Perrone,V.,Donini,M.,Zafar,M.B.,Schmucker,R.,Kenthapadi,K.,Archambeau, C.: Fair bayesian optimization. In: Proceedings of the 2021 AAAI/ACM Confer- ence on AI, Ethics, and Society. p. 854–863. AIES ’21, Association for Comput- ing Machinery, New York, NY, USA (2021).https://doi.org/10.1145/3461702. 3462629

  11. [11]

    WIREs Data Mining Knowl

    Quy, T.L., Roy, A., Iosifidis, V., Zhang, W., Ntoutsi, E.: A survey on datasets for fairness-aware machine learning. WIREs Data Mining Knowl. Discov.12(3) (2022).https://doi.org/10.1002/WIDM.1452

  12. [12]

    Schmucker, R., Donini, M., Perrone, V., Archambeau, C.: Multi-objective multi- fidelity hyperparameter optimization with application to fairness (2020)

  13. [13]

    In: García-Sánchez, P., Hart, E., Thomson, S.L

    Simões, J., Correia, J.: EDCA - an evolutionary data-centric automl framework for efficient pipelines. In: García-Sánchez, P., Hart, E., Thomson, S.L. (eds.) Applica- tions of Evolutionary Computation - 28th European Conference, EvoApplications 2025, Held as Part of EvoStar 2025, Trieste, Italy, April 23-25, 2025, Proceedings, Part II. Lecture Notes in Co...

  14. [14]

    In: EAAMO 2021: ACM Conference on Eq- uity and Access in Algorithms, Mechanisms, and Optimization, Virtual Event, USA, October 5 - 9, 2021

    Suresh, H., Guttag, J.V.: A framework for understanding sources of harm through- out the machine learning life cycle. In: EAAMO 2021: ACM Conference on Eq- uity and Access in Algorithms, Mechanisms, and Optimization, Virtual Event, USA, October 5 - 9, 2021. pp. 17:1–17:9. ACM (2021).https://doi.org/10.1145/ 3465416.3483305 16 J. Simões and J. Correia

  15. [15]

    Weerts, H.J.P., Pfisterer, F., Feurer, M., Eggensperger, K., Bergman, E., Awad, N.H., Vanschoren, J., Pechenizkiy, M., Bischl, B., Hutter, F.: Can fairness be au- tomated? guidelines and opportunities for fairness-aware automl. J. Artif. Intell. Res.79, 639–677 (2024).https://doi.org/10.1613/JAIR.1.14747

  16. [16]

    VLDB J.32(4), 791–813 (2023)

    Whang, S.E., Roh, Y., Song, H., Lee, J.: Data collection and quality challenges in deep learning: a data-centric AI perspective. VLDB J.32(4), 791–813 (2023). https://doi.org/10.1007/S00778-022-00775-9

  17. [17]

    In: ArXiv preprint arXiv:2111.06495 (2021)

    Wu, Q., Wang, C.: Fair automl. In: ArXiv preprint arXiv:2111.06495 (2021)

  18. [18]

    Zöller, M.A., Huber, M.F.: Benchmark and survey of automated machine learning frameworks. J. Artif. Intell. Res.70, 409–472 (2021).https://doi.org/10.1613/ JAIR.1.11854