OxEnsemble: Fair Ensembles for Low-Data Classification

Chris Russell; Jonathan Rystr{\o}m; Zihao Fu

arxiv: 2512.09665 · v2 · submitted 2025-12-10 · 💻 cs.CV · cs.CY· cs.LG

OxEnsemble: Fair Ensembles for Low-Data Classification

Jonathan Rystr{\o}m , Zihao Fu , Chris Russell This is my paper

Pith reviewed 2026-05-16 23:23 UTC · model grok-4.3

classification 💻 cs.CV cs.CYcs.LG

keywords fair classificationensembleslow-data regimesmedical imagingfairness constraintsdata efficiencyprediction aggregation

0 comments

The pith

OxEnsemble aggregates predictions from fairness-constrained ensemble members to deliver consistent fairness and accuracy in low-data medical imaging classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses fair classification when data is scarce and unbalanced across demographic groups, a common issue in medical imaging where errors carry high costs. It introduces OxEnsemble, which trains multiple models each satisfying fairness constraints and then aggregates their predictions. The method reuses held-out data in a careful way to enforce fairness without needing much extra data or compute beyond fine-tuning a single model. Theoretical guarantees support the approach, and experiments on medical imaging datasets show more consistent results and better fairness-accuracy trade-offs than prior methods.

Core claim

OxEnsemble works by training individual ensemble members to each satisfy fairness constraints and then combining their predictions, while reusing held-out data to enforce those constraints reliably. This construction makes the method both data-efficient and compute-efficient, requiring little more than the cost of fine-tuning or evaluating one model. The paper validates this with new theoretical guarantees and shows experimentally that it produces more consistent outcomes and stronger fairness-accuracy trade-offs than existing approaches across challenging medical imaging datasets.

What carries the argument

OxEnsemble, an ensemble construction in which each member is trained to satisfy fairness constraints before prediction aggregation, with careful held-out data reuse to enforce the constraints.

If this is right

Fairness can be enforced reliably even when labeled data per group is very limited.
The total compute stays close to that of training or evaluating one model.
Theoretical guarantees ensure the aggregation step preserves fairness properties.
Outcomes become more consistent across runs than single-model baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The held-out reuse pattern could apply to enforcing other constraints such as robustness or calibration.
Similar ensembles might improve performance in other high-stakes low-data domains beyond medical imaging.
The approach opens a route to fairness methods that avoid retraining the entire pipeline from scratch.

Load-bearing premise

Training separate models to meet fairness constraints individually and then aggregating their outputs will produce both fairness and accuracy gains without introducing new biases or overfitting from the held-out data reuse step.

What would settle it

Run the method on a medical imaging dataset where the ensemble's fairness metrics or accuracy on a fresh test set fall below those of a single model trained with the same fairness constraint.

Figures

Figures reproduced from arXiv: 2512.09665 by Chris Russell, Jonathan Rystr{\o}m, Zihao Fu.

**Figure 1.** Figure 1: (a) Comparisons. (b) OxEnsemble pipeline. Train (1): Members share backbone and task + protected attributes. Validate (2): Enforce fairness constraint while maximising accuracy. Predict (3): Majority vote. Partitioning ensures full coverage; shared backbone improves efficiency, and voting provides guarantees. Empirically, we demonstrate that OxEnsemble outperforms strong baselines in medical imaging—where … view at source ↗

**Figure 2.** Figure 2: Competence Violations vs Recall. Competence violations (Cρ; 0=perfect) are high when recall<0.5 and stabilize at recall>0.5. Left: Test set for fitting and evaluation. Right: Validation set for fitting, test set for evaluation. an ensemble where the members make independent errors. See appendix G.1 for details. This result is consistent with Jury Theorems (Condorcet, 1785; Berend and Paroush, 1998; Kanazaw… view at source ↗

**Figure 3.** Figure 3: Fairness–accuracy AUC (FairAUC) relative to ERM. OxEnsemble achieves higher FairAUC than all baselines on Fitzpatrick17k (left) and HAM10000 (right). Error bars show 95% bootstrap CIs. Evaluation follows § 4 over minimumrecall thresholds in [0.5, 1]. introduces a regularisation term accounting for the protected attribute and fairness criterion (Buyl et al., 2023), while OxonFair tunes decision thresholds … view at source ↗

**Figure 4.** Figure 4: Pareto frontiers across datasets. OxEnsemble (green) yields better fairness–accuracy trade-offs than baselines (grey). Left/centre: min recall (HAM10000, Fitzpatrick17k). Right: equal opportunity (FairVLMed). See § 4 for definitions. FairVLMed: In [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Relationship between Ensemble Size (X-axis) and FairAUC (Y-axis) across two datasets. No significant relationship is observed. Assumption 1 (Competence) Let Wρ,D ≡ Wρ(X, Y ) = Eh∼ρ,D[1(h(X) ̸= Y )]. The ensemble ρ is competent if for every 0 ≤ t ≤ 1/2, P(Wρ,D ∈ [t, 1/2)) ≥ P(Wρ,D ∈ [1/2, 1 − t]). (9) This assumption can be interpreted as formalising the statement that a majority voting ensemble is more lik… view at source ↗

**Figure 6.** Figure 6: Relationship between FairAUC on validation (X-axis) and test set (Y-axis) across ensemble sizes. The relationship is too noisy to guide model selection [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Empirical validation of competence proofs. We show that enforcing minimum recall, k > 0.5 + δ, leads to competent ensembles (see § 3.2). δ depends on the data size (§ 3.2.3) and 0.5 comes from our proof in § 3.2.2. The data points above thresholds, are above the X-axis, whereas the points below the threshold are on both sides. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

We address the problem of fair classification in settings where data is scarce and unbalanced across demographic groups. Such low-data regimes are common in domains like medical imaging, where false negatives can have fatal consequences. We propose a novel approach \emph{OxEnsemble} for efficiently training ensembles and enforcing fairness in these low-data regimes. Unlike other approaches, we aggregate predictions across ensemble members, each trained to satisfy fairness constraints. By construction, \emph{OxEnsemble} is both data-efficient -- carefully reusing held-out data to enforce fairness reliably -- and compute-efficient, requiring little more compute than used to fine-tune or evaluate an existing model. We validate this approach with new theoretical guarantees. Experimentally, our approach yields more consistent outcomes and stronger fairness-accuracy trade-offs than existing methods across multiple challenging medical imaging classification datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OxEnsemble claims an efficient ensemble construction for fairness in low-data medical imaging by per-member constraints plus held-out data reuse, but the abstract leaves the reuse rule and guarantees too vague to judge if it actually works.

read the letter

The paper's core idea is to train each ensemble member to satisfy fairness constraints individually, then aggregate the predictions, while reusing held-out data to enforce those constraints without much extra cost. It positions this as both data-efficient and compute-efficient compared with other fairness approaches, and it reports better consistency and fairness-accuracy trade-offs on medical imaging datasets. That construction is the main novelty on offer: a specific way to bake fairness into the ensemble members rather than post-processing or adding heavy regularizers.

Referee Report

3 major / 2 minor

Summary. The paper proposes OxEnsemble, an ensemble approach for fair classification in low-data, unbalanced regimes (e.g., medical imaging). Each ensemble member is trained to satisfy fairness constraints before predictions are aggregated. The method is claimed to be both data-efficient (via careful reuse of held-out data) and compute-efficient (requiring little more than fine-tuning or evaluating an existing model), supported by new theoretical guarantees. Experiments on multiple medical imaging datasets reportedly show more consistent outcomes and improved fairness-accuracy trade-offs compared to existing methods.

Significance. If the held-out reuse protocol demonstrably avoids leakage and the theoretical guarantees hold under realistic low-data assumptions (small per-group sample sizes), the approach would offer a practical, low-overhead route to fairness in safety-critical domains where data scarcity is common. The emphasis on compute efficiency relative to single-model fine-tuning is a notable strength if substantiated.

major comments (3)

[Abstract and Section 3] Abstract and Methods (Section 3): The central claim that data-efficiency arises 'by construction' from held-out data reuse for fairness enforcement lacks an explicit reuse protocol (e.g., whether fairness is evaluated on the same validation fold or a disjoint subset). This detail is load-bearing for the no-leakage and reliability assertions, especially given the targeted regime of n ≪ 100 per demographic group where reuse can induce optimistic bias.
[Section 4] Theoretical Guarantees (Section 4): The new theoretical guarantees are asserted without a clear statement of the assumptions required for their validity (independence of ensemble members, bounded variance of fairness metrics, or minimum group-size conditions). Without these, it is difficult to evaluate applicability to the low-data medical imaging setting described.
[Section 5] Experiments (Section 5): The reported superiority in fairness-accuracy trade-offs and consistency is presented without sufficient error analysis, ablation on the reuse mechanism, or details on how held-out data is partitioned across the ensemble, making it hard to confirm that gains are not artifacts of the reuse strategy.

minor comments (2)

[Section 3] Notation for fairness metrics and ensemble aggregation should be introduced with explicit equations early in the methods section for clarity.
[Section 5] Figure captions and table legends would benefit from more detail on dataset characteristics (e.g., exact per-group sample sizes) to allow readers to assess the low-data regime.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and will incorporate revisions to improve clarity, rigor, and completeness while preserving the core contributions of the work.

read point-by-point responses

Referee: [Abstract and Section 3] Abstract and Methods (Section 3): The central claim that data-efficiency arises 'by construction' from held-out data reuse for fairness enforcement lacks an explicit reuse protocol (e.g., whether fairness is evaluated on the same validation fold or a disjoint subset). This detail is load-bearing for the no-leakage and reliability assertions, especially given the targeted regime of n ≪ 100 per demographic group where reuse can induce optimistic bias.

Authors: We agree that the reuse protocol requires greater explicitness to support the no-leakage claim. The current manuscript describes careful reuse of held-out data in Section 3, but does not fully detail the partitioning. In the revision we will add a precise protocol: fairness constraints are enforced on a disjoint subset of the held-out data that is never used for training or primary validation of any ensemble member. We will include pseudocode and a small diagram illustrating the partitioning to eliminate any possibility of optimistic bias in the n ≪ 100 regime. revision: yes
Referee: [Section 4] Theoretical Guarantees (Section 4): The new theoretical guarantees are asserted without a clear statement of the assumptions required for their validity (independence of ensemble members, bounded variance of fairness metrics, or minimum group-size conditions). Without these, it is difficult to evaluate applicability to the low-data medical imaging setting described.

Authors: We will expand Section 4 with a dedicated assumptions paragraph. The guarantees rely on (i) ensemble members being trained from different random initializations yielding approximate independence, (ii) fairness metrics satisfying standard bounded-variance conditions via concentration inequalities, and (iii) per-group sample sizes n_g ≥ 20 to ensure the derived bounds remain meaningful. We will also add a short discussion of how these assumptions align with the medical imaging datasets used in the experiments. revision: yes
Referee: [Section 5] Experiments (Section 5): The reported superiority in fairness-accuracy trade-offs and consistency is presented without sufficient error analysis, ablation on the reuse mechanism, or details on how held-out data is partitioned across the ensemble, making it hard to confirm that gains are not artifacts of the reuse strategy.

Authors: We accept that additional experimental rigor is warranted. The revised Section 5 will report mean and standard deviation over five independent runs together with paired statistical significance tests. We will also add an ablation study that isolates the reuse mechanism (comparing disjoint vs. overlapping fairness evaluation) and will explicitly document the partitioning scheme used for each ensemble member. These additions will demonstrate that the observed improvements are robust rather than artifacts of the reuse strategy. revision: yes

Circularity Check

0 steps flagged

No circularity: efficiency and guarantees presented as design properties with independent theoretical support

full rationale

The abstract states that OxEnsemble is data-efficient and compute-efficient 'by construction' via held-out data reuse and aggregation of fairness-constrained members, then validates this with 'new theoretical guarantees.' No equations, parameter fits, or self-citations are shown that reduce the claimed guarantees or efficiency to the inputs themselves. The reuse mechanism is described as a deliberate protocol rather than a self-defining loop, and the theoretical guarantees are positioned as external validation rather than derived tautologically from the method's definition. This is a standard design claim without the specific reductions required for circularity flags.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; central claims rest on unstated details of fairness constraint enforcement and data reuse validity.

axioms (1)

domain assumption Held-out data can be reused to enforce fairness constraints per ensemble member without invalidating the overall guarantees or introducing selection bias
Invoked in the data-efficiency claim of the abstract.

pith-pipeline@v0.9.0 · 5437 in / 1206 out tokens · 38432 ms · 2026-05-16T23:23:42.647675+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel approach OxEnsemble for efficiently training ensembles and enforcing fairness in these low-data regimes... each trained to satisfy fairness constraints... majority vote.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove that our fair ensembles are guaranteed to preserve fairness under both error-parity and minimum rate constraints...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 1 internal anchor

[1]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

doi: 10.1117/1.JMI.10.6.061104. URL https://www.spiedigitallibrary. org/journals/journal-of-medical-imaging/volume-10/issue-06/061104/ Toward-fairness-in-artificial-intelligence-for-medical-image-analysis/ 10.1117/1.JMI.10.6.061104.full. Raman Dutt, Ondrej Bohdal, Sotirios A. Tsaftaris, and Timothy Hospedales. FairTune: Optimizing parameter efficient fine...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1117/1.jmi.10.6.061104 2023
[2]

In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

doi: 10.1109/CVPRW53098.2021.00201. URL https://ieeexplore.ieee.org/ document/9522867. Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learn- ing. InAdvances in Neural Information Processing Systems, volume 29. Curran As- sociates, Inc., 2016. URL https://papers.nips.cc/paper_files/paper/2016/hash/ 9d2682367c3935defcb1f9e2...

work page doi:10.1109/cvprw53098.2021.00201 2021
[3]

doi: 10.1088/2516-1091/ad525b

ISSN 2516-1091. doi: 10.1088/2516-1091/ad525b. Marcus Pivato. Epistemic democracy with correlated voters.Journal of Mathematical Economics, 72:51–69, October 2017. ISSN 0304-4068. doi: 10.1016/j.jmateco.2017.06.001. URLhttps://www.sciencedirect.com/science/article/pii/S0304406816301094. María Agustina Ricci Lara, Candelaria Mosquera, Enzo Ferrante, and Ro...

work page doi:10.1088/2516-1091/ad525b 2017
[4]

Data Access and Information We provide links for accessing the data in Table 3

Link:https://github.com/jhrystrom/guaranteed-fair-ensemble 17 Rystrøm Fu Russell Appendix B. Data Access and Information We provide links for accessing the data in Table 3. While all data is openly available for academic research, some of it requires approval by the providers. For detailed summary statistics for HAM10000 and Fitzpatrick17k, see the supple...

work page doi:10.7910/dvn/dbw86t 2022
[5]

How does ensemble size affect performance?

considers a fixed distributionD = ( X, Y ), which they frequently drop from their notation, we preserve it as we will want to varyD. Their results are as follows: The ensemble improvement rate is defined as: EIRD = Eh∼ρ[LD(h)]−L D(hMV) Eh∼ρ[LD(h)] .(7) and the Disagreement-Error Ratio as: DERD = Eh,h′∼ρ[DD(h, h′)] Eh∼ρ[LD(h)] .(8) Where LD(h)is the error ...

work page 2023
[6]

Negative flips decrease probabilities(given by Lemma 3) Given a subsetp of ensemble models taking positive labels, with their complement taking negative labels, flipping some of p so they also take negative labels to obtain a newq subset will result inq having a lower probability of occurring thanp

work page
[7]

17 determine

Matching ps and qs(given by Lemma 4) It is possible to identify matching pairs of such pof sizesandqof size|ρ| −sin equation Eq. 17 determine. Lemma 3Ifp⊇q, the following inequality holds for their associated summands: Ip ≥I q (12) 23 Rystrøm Fu Russell ProofTo see this, we write n = ¯pfor the members of the ensemble that take a negative label in bothp an...

work page 1976

[1] [1]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

doi: 10.1117/1.JMI.10.6.061104. URL https://www.spiedigitallibrary. org/journals/journal-of-medical-imaging/volume-10/issue-06/061104/ Toward-fairness-in-artificial-intelligence-for-medical-image-analysis/ 10.1117/1.JMI.10.6.061104.full. Raman Dutt, Ondrej Bohdal, Sotirios A. Tsaftaris, and Timothy Hospedales. FairTune: Optimizing parameter efficient fine...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1117/1.jmi.10.6.061104 2023

[2] [2]

In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

doi: 10.1109/CVPRW53098.2021.00201. URL https://ieeexplore.ieee.org/ document/9522867. Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learn- ing. InAdvances in Neural Information Processing Systems, volume 29. Curran As- sociates, Inc., 2016. URL https://papers.nips.cc/paper_files/paper/2016/hash/ 9d2682367c3935defcb1f9e2...

work page doi:10.1109/cvprw53098.2021.00201 2021

[3] [3]

doi: 10.1088/2516-1091/ad525b

ISSN 2516-1091. doi: 10.1088/2516-1091/ad525b. Marcus Pivato. Epistemic democracy with correlated voters.Journal of Mathematical Economics, 72:51–69, October 2017. ISSN 0304-4068. doi: 10.1016/j.jmateco.2017.06.001. URLhttps://www.sciencedirect.com/science/article/pii/S0304406816301094. María Agustina Ricci Lara, Candelaria Mosquera, Enzo Ferrante, and Ro...

work page doi:10.1088/2516-1091/ad525b 2017

[4] [4]

Data Access and Information We provide links for accessing the data in Table 3

Link:https://github.com/jhrystrom/guaranteed-fair-ensemble 17 Rystrøm Fu Russell Appendix B. Data Access and Information We provide links for accessing the data in Table 3. While all data is openly available for academic research, some of it requires approval by the providers. For detailed summary statistics for HAM10000 and Fitzpatrick17k, see the supple...

work page doi:10.7910/dvn/dbw86t 2022

[5] [5]

How does ensemble size affect performance?

considers a fixed distributionD = ( X, Y ), which they frequently drop from their notation, we preserve it as we will want to varyD. Their results are as follows: The ensemble improvement rate is defined as: EIRD = Eh∼ρ[LD(h)]−L D(hMV) Eh∼ρ[LD(h)] .(7) and the Disagreement-Error Ratio as: DERD = Eh,h′∼ρ[DD(h, h′)] Eh∼ρ[LD(h)] .(8) Where LD(h)is the error ...

work page 2023

[6] [6]

Negative flips decrease probabilities(given by Lemma 3) Given a subsetp of ensemble models taking positive labels, with their complement taking negative labels, flipping some of p so they also take negative labels to obtain a newq subset will result inq having a lower probability of occurring thanp

work page

[7] [7]

17 determine

Matching ps and qs(given by Lemma 4) It is possible to identify matching pairs of such pof sizesandqof size|ρ| −sin equation Eq. 17 determine. Lemma 3Ifp⊇q, the following inequality holds for their associated summands: Ip ≥I q (12) 23 Rystrøm Fu Russell ProofTo see this, we write n = ¯pfor the members of the ensemble that take a negative label in bothp an...

work page 1976