pith. sign in

arxiv: 2605.14260 · v2 · pith:IBE4NJSLnew · submitted 2026-05-14 · 📊 stat.ML · cs.LG

On the Burden of Achieving Fairness in Conformal Prediction

Pith reviewed 2026-05-19 17:03 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords conformal predictionfairnesscoverage distortionquantile heterogeneityprediction setsgroup calibrationequalized coverageset size
0
0 comments X

The pith

Pooled calibration in conformal prediction creates irreducible group coverage distortion scaled by quantile heterogeneity and places equal coverage in direct conflict with equal set size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes through population-level analysis that using a single shared threshold for conformal calibration produces group-wise coverage that deviates from the nominal level in a way no choice of threshold can eliminate, with the deviation size fixed by how much the groups differ in their score quantiles. It further demonstrates that the two standard fairness targets cannot be satisfied together: any policy that equalizes coverage across groups necessarily produces unequal average set sizes, and vice versa. This matters because conformal methods are widely adopted to deliver distribution-free coverage guarantees in machine learning, and the results clarify that fairness interventions merely relocate the cross-group heterogeneity rather than remove it. Experiments on both synthetic distributions and real datasets show the same trade-off appears after finite-sample calibration.

Core claim

Through analysis of the underlying score distributions, the authors derive a conservation law and lower bound showing that pooled calibration incurs irreducible group-wise coverage distortion whose scale is set by cross-group quantile heterogeneity. They establish that the fairness definitions of Equalized Coverage and Equalized Set Size are fundamentally in tension, so that calibration policy choice determines whether the resulting distortion appears in the coverage dimension or the set-size dimension.

What carries the argument

The conservation law and lower bound on coverage distortion induced by cross-group quantile heterogeneity in the population score distributions.

If this is right

  • Calibration policy determines the dimension in which cross-group heterogeneity manifests, either coverage or set size.
  • Switching from pooled to group-specific calibration removes coverage distortion but changes relative set sizes across groups.
  • The bidirectional trade-off between the two fairness criteria remains visible after finite-sample calibration on both synthetic and real data.
  • No policy within the studied families eliminates cross-group heterogeneity; it only relocates the distortion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners must explicitly choose which dimension to equalize based on whether coverage reliability or prediction-set size matters more for their use case.
  • The lower bound supplies a concrete way to measure the minimum fairness cost imposed by a given dataset's group score differences before any calibration is chosen.
  • The same conservation structure could be used to evaluate fairness trade-offs in other distribution-free uncertainty quantification procedures.

Load-bearing premise

The results rest on the assumption that population-level score distributions fully capture all relevant cross-group heterogeneity and that the derived bounds carry over to finite-sample calibration without extra effects.

What would settle it

A dataset in which pooled calibration is performed and the measured group-wise coverage deviations fall below the lower bound predicted from the groups' quantile differences, or in which both equal coverage and equal set sizes are observed simultaneously with no measurable trade-off.

Figures

Figures reproduced from arXiv: 2605.14260 by Archer Yi Yang, Jesse C. Cresswell, Masoud Asgharian, Mouloud Belbahri, Pengqi Liu, Ziang Gao.

Figure 1
Figure 1. Figure 1: Bidirectional policy conversion in the synthetic study. Panels A–B illustrate the coverage [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Two-group Gaussian pooled-threshold [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bias in Bios mechanism view at α = 0.1 for the simple score. Panel A illustrates the pooled-threshold mechanism in Theorem 1; Panels B–C illustrate Theorems 3–4 and Corollaries 1–2; Panel D summarizes the three distortions (Theorem 2, Corollaries 1–2) for male and female groups [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MultiNLI at α = 0.1 with simple (left) and RAPS (right) scores. For each score, Panel A shows signed coverage distortion Fˆ S|g(ˆq) − (1 − α) under pooled threshold (Theorem 1): positive bars indicate over-coverage and negative bars indicate under-coverage. Panel B shows the signed change in expected set size ˆℓg(ˆqg) − ˆℓg(ˆq) after switching to group-wise thresholds that equalize coverage (Corollary 1). … view at source ↗
Figure 5
Figure 5. Figure 5: FACET at α = 0.1 with the RAPS score; Panel A illustrates Theorem 1, and Pan￾els B–C illustrate Corollaries 1–2. We next show the same mechanisms on FACET [13] using the RAPS score on the age group split (Younger, Middle, Older, Unknown) with a zero-shot CLIP ViT-L/14 classifier [19] [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Four multi-group families. Across all four score families, the empirical RMS miscoverage [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Imbalanced four-group pooled-threshold diagnostics. Across all four families, the weighted [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Bias in Bios mechanism view at α = 0.10 for SAPS score. Panel A illustrates the pooled-threshold mechanism in Theorem 1; Panels B–C illustrate Theorems 3– 4 and Corollaries 1–2; Panel D summarizes the three distortions for male and female groups. 0.0 0.5 1.0 1.5 2.0 True-label RAPS nonconformity score 0.0 0.2 0.4 0.6 0.8 1.0 Empirical CDF A. Calibration ECDFs and thresholds Male Female Pooled threshold Tar… view at source ↗
Figure 9
Figure 9. Figure 9: Bias in Bios mechanism view at α = 0.10 for RAPS score. Panel A illustrates the pooled-threshold mechanism in Theorem 1; Panels B–C illustrate Theorems 3– 4 and Corollaries 1–2; Panel D summarizes the three distortions for male and female groups. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Finite-calibration detectability diagnostics of the pooled-threshold floor from Theorem [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: shows that across α ∈ {0.05, 0.07, 0.085, 0.10}, the empirical pooled-threshold distortion stays above the estimated lower-bound scale, while the induced size and coverage distortions remain nonzero throughout. E.3 Controlled Genre Temperature Sweep At α = 0.1, we perturb only the facetoface genre via temperature scaling [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Controlled MultiNLI temperature sweep at α = 0.10 for the simple score, perturbing the facetoface genre only. Panel A corresponds to Theorem 2. Panels B–C show Corollaries 1–2. The same trade-off mechanism remains visible across the temperature sweep. -0.024 0.000 0.024 Coverage distortion Government Oup Verbatim Travel Letters Facetoface Fiction Telephone Slate Nineeleven A Pooled threshold -0.141 0.000 … view at source ↗
Figure 13
Figure 13. Figure 13: MultiNLI at α = 0.10 using SAPS score. For this score, Panel A shows pooled quantile consequence of Theorem 1; Panels B and C illustrate the set size distortion in Corollary 1, and the coverage distortion in Corollary 2, respectively. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: MultiNLI robustness across α for SAPS (top) and RAPS (bottom) scores. In each row, Panel A is best read as a finite-sample diagnostic for Theorem 2 based on an estimated lower-bound proxy, rather than a pointwise lower-bound verification. Panels B–C show Corollaries 1–2. The induced set-size and coverage distortions remain visible across the tested α-grid. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Controlled MultiNLI temperature sweep at α = 0.10 for SAPS (top) and RAPS (bottom) scores, perturbing the facetoface genre only. Panel A is best read as a finite-sample diagnostic for Theorem 2. Panels B–C illustrate Corollaries 1–2. The same trade-off mechanism remains visible across the temperature sweep. 50 100 150 200 250 300 350 400 450500 Calibration size 1.0 1.5 2.0 2.5 SNR = true floor / sd(empiri… view at source ↗
Figure 16
Figure 16. Figure 16: Finite-calibration detectability of the pooled-threshold floor from Theorem [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: shows that for various target levels α ∈ {0.05, 0.07, 0.085, 0.10}, the empirical pooled-threshold distortion stays above the empirical lower bound. The induced set size distortion remains nonzero throughout and the equalized expected set size policy continues to produce a nonzero RMS coverage distortion. Next, we perturb only the Younger group through temperature scaling while keeping the rest of the gro… view at source ↗
Figure 18
Figure 18. Figure 18: Controlled FACET temperature sweep at α = 0.10 for the RAPS score, perturbing the Younger group only. Panel A evaluates the lower-bound behavior in Theorem 2. Panels B–C show Corollaries 1–2. The same trade-off mechanism remains visible across the temperature sweep. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗
read the original abstract

Conformal prediction is often calibrated with a single pooled threshold, but this can hide cross-group heterogeneity in score distributions and distort group-wise coverage. We study this phenomenon through the population score distributions underlying split conformal calibration. First, we derive a conservation law and lower bound showing that pooled calibration incurs irreducible group-wise coverage distortion at a scale set by cross-group quantile heterogeneity. Second, we demonstrate that the two leading fairness definitions for conformal prediction, Equalized Coverage and Equalized Set Size, are fundamentally in tension. Third, we quantify the cost of moving between policies which treat groups separately or pool them. Experiments on synthetic and real data confirm the same bidirectional trade-off after finite-sample calibration. Our results show that, for the policy families studied here, calibration choice does not remove cross-group heterogeneity; it determines whether the resulting distortion appears in the coverage or size dimension, providing a principled lens for analyzing fairness-oriented calibration choices in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies fairness issues in split conformal prediction arising from pooled calibration when score distributions differ across groups. It derives a conservation law and lower bound on irreducible group-wise coverage distortion from population score quantiles and their heterogeneity, shows that Equalized Coverage and Equalized Set Size are in fundamental tension, and quantifies the costs of moving between separate-group and pooled policies. Experiments on synthetic and real data are presented to confirm that the bidirectional trade-off persists after finite-sample calibration.

Significance. If the central results hold, the work supplies a useful theoretical lens for analyzing calibration choices in conformal prediction under group heterogeneity. The derivation of the conservation law directly from population score distributions (without reliance on fitted model parameters) is a clear strength, as is the explicit demonstration of tension between the two leading fairness notions and the quantification of policy costs. The experimental confirmation on real data adds practical relevance. These elements could help practitioners understand why fairness-oriented adjustments shift rather than eliminate distortion.

major comments (2)
  1. [theoretical derivation of the conservation law and lower bound] The conservation law and lower bound (derived from population score distributions and quantile heterogeneity) are asserted to characterize the distortion under finite-sample split conformal calibration. However, the manuscript provides no finite-sample error analysis or bound on the additional coverage distortion arising from the interaction between cross-group quantile heterogeneity and the sampling variability of empirical quantiles computed on the calibration set. This leaves open whether the population-level effect dominates the observed distortions or whether unmodeled finite-sample bias/variance is material.
  2. [analysis of tension between Equalized Coverage and Equalized Set Size] The claim that the two fairness definitions are fundamentally in tension is established at the population level. It is not shown whether this tension remains strict after replacing population quantiles with empirical ones, or whether finite-sample effects can partially alleviate or exacerbate the incompatibility between Equalized Coverage and Equalized Set Size.
minor comments (2)
  1. [methodology] Clarify the exact definition of the score function and how group membership is handled in the pooled versus separate calibration procedures to ensure reproducibility.
  2. [experiments] Add error bars or report the number of random trials and calibration-set sizes used in the synthetic and real-data experiments to allow assessment of variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The points raised regarding finite-sample considerations are well-taken and help sharpen the presentation of our population-level results. We address each major comment below and have revised the manuscript accordingly to clarify the scope of the theory and discuss empirical behavior.

read point-by-point responses
  1. Referee: [theoretical derivation of the conservation law and lower bound] The conservation law and lower bound (derived from population score distributions and quantile heterogeneity) are asserted to characterize the distortion under finite-sample split conformal calibration. However, the manuscript provides no finite-sample error analysis or bound on the additional coverage distortion arising from the interaction between cross-group quantile heterogeneity and the sampling variability of empirical quantiles computed on the calibration set. This leaves open whether the population-level effect dominates the observed distortions or whether unmodeled finite-sample bias/variance is material.

    Authors: We agree that the conservation law and lower bound are derived exactly at the population level from the true score distributions and their quantiles. The current manuscript does not include a rigorous finite-sample error analysis that bounds the additional distortion induced by the variability of empirical quantiles. This is a genuine limitation of the theoretical development. At the same time, the experiments (both synthetic, where distributions are known, and real-data) show that the observed group-wise coverage distortions track the population predictions closely for typical calibration sizes. In the revision we have added a paragraph in the discussion section that explicitly acknowledges the gap, invokes standard uniform convergence results for empirical quantiles to argue that the finite-sample distortion converges to the derived population bound as calibration-set size grows, and includes supplementary convergence plots in the appendix. revision: yes

  2. Referee: [analysis of tension between Equalized Coverage and Equalized Set Size] The claim that the two fairness definitions are fundamentally in tension is established at the population level. It is not shown whether this tension remains strict after replacing population quantiles with empirical ones, or whether finite-sample effects can partially alleviate or exacerbate the incompatibility between Equalized Coverage and Equalized Set Size.

    Authors: The proof that Equalized Coverage and Equalized Set Size are in fundamental tension is obtained by showing that, at the population level, satisfying one exactly forces the other to deviate unless cross-group quantile heterogeneity is zero. In finite samples the use of empirical quantiles introduces additional variability that could, in principle, create slack allowing partial satisfaction of both criteria simultaneously. Our experiments nevertheless demonstrate that the bidirectional trade-off persists under finite-sample calibration on both synthetic and real data, with the magnitude of the incompatibility scaling in line with the population prediction. In the revision we have added a short discussion paragraph clarifying that while finite-sample noise is present, the asymptotic incompatibility remains and is not alleviated by the empirical approximation; the experimental results are presented as supporting evidence that the tension is not an artifact of the population analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from population distributions

full rationale

The core claims rest on a direct derivation of a conservation law and lower bound from the population score distributions and their quantile heterogeneity, followed by an algebraic demonstration that Equalized Coverage and Equalized Set Size are in tension. Neither step reduces to a fitted parameter renamed as a prediction, a self-definitional loop, or a load-bearing self-citation whose content is itself unverified. The finite-sample extension is presented as an empirical confirmation rather than a mathematical reduction, and the paper does not invoke uniqueness theorems or ansatzes from prior author work to force its conclusions. The analysis therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard conformal prediction assumptions about score distributions and calibration; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Population score distributions underlie split conformal calibration and capture cross-group heterogeneity.
    Invoked when studying the phenomenon through population score distributions and deriving the conservation law.

pith-pipeline@v0.9.0 · 5704 in / 1186 out tokens · 37954 ms · 2026-05-19T17:03:38.013668+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    Uncer- tainty sets for image classifiers using conformal prediction

    Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncer- tainty sets for image classifiers using conformal prediction. InInternational Conference on Learning Representations, 2021

  2. [2]

    A Convex Loss Function for Set Prediction with Optimal Trade-offs Between Size and Conditional Coverage.arXiv:2512.19142, 2025

    Francis Bach. A Convex Loss Function for Set Prediction with Optimal Trade-offs Between Size and Conditional Coverage.arXiv:2512.19142, 2025

  3. [3]

    Big Data 5(2), 153–163 (2017) https://doi.org/10.1089/big.2016.0047

    Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments.Big data, 5(2):153–163, 2017. doi: 10.1089/big.2016.0047

  4. [4]

    Cresswell, Yi Sui, Bhargava Kumar, and Noël V ouitsis

    Jesse C. Cresswell, Yi Sui, Bhargava Kumar, and Noël V ouitsis. Conformal prediction sets improve human decision making. InProceedings of the 41st International Conference on Machine Learning, 2024

  5. [5]

    Cresswell, Bhargava Kumar, Yi Sui, and Mouloud Belbahri

    Jesse C. Cresswell, Bhargava Kumar, Yi Sui, and Mouloud Belbahri. Conformal prediction sets can cause disparate impact. InThe Thirteenth International Conference on Learning Representations, 2025

  6. [6]

    Dragan and Moritz Hardt , title =

    Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexan- dra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting. InProceedings of the Conference on Fairness, Accountability, and Transparency, page 120–128, 2019. ISBN ...

  7. [7]

    Asymptotic minimax character of the sam- ple distribution function and of the classical multinomial estimator.The Annals of Mathematical Statistics, pages 642–669, 1956

    Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sam- ple distribution function and of the classical multinomial estimator.The Annals of Mathematical Statistics, pages 642–669, 1956

  8. [8]

    The limits of distribution-free conditional predictive inference.Information and Inference: A Journal of the IMA, 10(2):455–482, 2021

    Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-free conditional predictive inference.Information and Inference: A Journal of the IMA, 10(2):455–482, 2021

  9. [9]

    De finetti’s theorem and related results for infinite weighted exchangeable sequences.Bernoulli, 30(4): 3004–3028, 2024

    Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. De finetti’s theorem and related results for infinite weighted exchangeable sequences.Bernoulli, 30(4): 3004–3028, 2024

  10. [10]

    V olume optimality in conformal prediction with structured prediction sets

    Chao Gao, Liren Shan, Vaidehi Srinivas, and Aravindan Vijayaraghavan. V olume optimality in conformal prediction with structured prediction sets. InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 18495–18527, 2025

  11. [11]

    Conformal prediction with conditional guarantees.Journal of the Royal Statistical Society Series B: Statistical Methodology, 87(4): 1100–1126, 03 2025

    Isaac Gibbs, John J Cherian, and Emmanuel J Candès. Conformal prediction with conditional guarantees.Journal of the Royal Statistical Society Series B: Statistical Methodology, 87(4): 1100–1126, 03 2025. ISSN 1369-7412. doi: 10.1093/jrsssb/qkaf008

  12. [12]

    Counterfactually fair conformal prediction

    Ozgur Guldogan, Neeraj Sarna, Yuanyuan Li, and Michael Berger. Counterfactually fair conformal prediction. InProceedings of The 29th International Conference on Artificial Intelligence and Statistics, 2026

  13. [13]

    FACET: Fairness in computer vision evaluation benchmark

    Laura Gustafson, Chloe Rolland, Nikhila Ravi, Quentin Duval, Aaron Adcock, Cheng-Yang Fu, Melissa Hall, and Candace Ross. FACET: Fairness in computer vision evaluation benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20370– 20382, 2023

  14. [14]

    Equality of opportunity in supervised learning

    Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems 29, pages 3315–3323, 2016

  15. [15]

    Inherent trade-offs in the fair determination of risk scores

    Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In8th Innovations in Theoretical Computer Science Conference, volume 67, pages 43:1–43:23, 2017. doi: 10.4230/LIPIcs.ITCS.2017.43

  16. [16]

    Claire Lazar Reich and Suhas Vijaykumar. A Possibility in Algorithmic Fairness: Can Calibra- tion and Equal Error Rates Be Reconciled? In2nd Symposium on Foundations of Responsible Computing, volume 192, pages 4:1–4:21, 2021. doi: 10.4230/LIPIcs.FORC.2021.4. 11

  17. [17]

    Conformal- ized fairness via quantile regression.Advances in Neural Information Processing Systems, 35: 11561–11572, 2022

    Meichen Liu, Lei Ding, Dengdeng Yu, Wulong Liu, Linglong Kong, and Bei Jiang. Conformal- ized fairness via quantile regression.Advances in Neural Information Processing Systems, 35: 11561–11572, 2022

  18. [18]

    Cresswell

    Pengqi Liu, Zijun Yu, Mouloud Belbahri, Arthur Charpentier, Masoud Asgharian, and Jesse C. Cresswell. Beyond procedure: Substantive fairness in conformal prediction. InProceedings of the 43rd International Conference on Machine Learning, 2026. To appear

  19. [19]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 8748–8763, 2021

  20. [20]

    With malice toward none: Assessing uncertainty via equalized coverage.Harvard Data Science Review, 2(2):4, 2020

    Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and Emmanuel Candès. With malice toward none: Assessing uncertainty via equalized coverage.Harvard Data Science Review, 2(2):4, 2020

  21. [21]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv:1910.01108, 2019

  22. [22]

    A Tutorial on Conformal Prediction.Journal of Machine Learning Research, 9(12):371–421, 2008

    Glenn Shafer and Vladimir V ovk. A Tutorial on Conformal Prediction.Journal of Machine Learning Research, 9(12):371–421, 2008

  23. [23]

    Routledge, 2018

    Bernard W Silverman.Density estimation for statistics and data analysis. Routledge, 2018

  24. [24]

    The coverage-deferral trade-off: Fairness implications of conformal predic- tion in human-in-the-loop decision systems.Preprints, 2025

    Davut Emre Tasar. The coverage-deferral trade-off: Fairness implications of conformal predic- tion in human-in-the-loop decision systems.Preprints, 2025. doi: 10.20944/preprints202512. 2631.v1

  25. [25]

    Vadlamani, Anutam Srinivasan, Pranav Maneriker, Ali Payani, and Srinivasan Parthasarathy

    Aditya T. Vadlamani, Anutam Srinivasan, Pranav Maneriker, Ali Payani, and Srinivasan Parthasarathy. A generic framework for conformal fairness. InThe Thirteenth International Conference on Learning Representations, 2025

  26. [26]

    Mondrian confidence machine.Technical Report, 2003

    Vladimir V ovk, David Lindsay, Ilia Nouretdinov, and Alex Gammerman. Mondrian confidence machine.Technical Report, 2003

  27. [27]

    Springer, 2005

    Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic learning in a random world. Springer, 2005

  28. [28]

    Equal opportunity of coverage in fair regression.Advances in Neural Information Processing Systems, 36:7743–7755, 2023

    Fangxin Wang, Lu Cheng, Ruocheng Guo, Kay Liu, and Philip S Yu. Equal opportunity of coverage in fair regression.Advances in Neural Information Processing Systems, 36:7743–7755, 2023

  29. [29]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

    Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, 2018. doi: 10.18653/v1/N18-1101

  30. [30]

    Conformal classification with equalized coverage for adaptively selected groups.Advances in Neural Information Processing Systems, 37:108760–108823, 2024

    Yanfei Zhou and Matteo Sesia. Conformal classification with equalized coverage for adaptively selected groups.Advances in Neural Information Processing Systems, 37:108760–108823, 2024. 12 Appendix Contents • Appendix A: Technical Discussion • Appendix B: Proofs of Theoretical Results • Appendix C: Additional Experimental Details • Appendix D:Bias in BiosE...

  31. [31]

    Under exact conservation,δ(q) = 0, we haveΩ o(q) Ωu(q)≥B 2 K. When q is the pooled population quantile and the mixture CDF FS is continuous at q, Theorem 1 givesδ(q) = 0, so the exact-conservation form in part 3 is the relevant pooled-calibration case. Theorem 6 refines Theorem 1 from a signed additive conservation law to a magnitude lower bound. The prod...

  32. [32]

    The incompatibility follows from a Bayes coupling identity relating predictive values, base rates, and the likelihood ratio TPRg/FPRg

    shows that in binary settings, when two groups have different base rates,πg =P(Y= 1|G=g) , predictive parity, i.e., equalPPVg =P(Y= 1| ˆY= 1, G=g) across groups, cannot generally hold simultaneously with equalized error profiles matching the false positive rates FPRg =P( ˆY= 1| Y= 0, G=g) and the true positive rates TPRg =P( ˆY= 1|Y= 1, G=g) . The incompa...

  33. [33]

    More precisely, for every g∈ H r \ {r},ℓ g(qg)−ℓ r(qr)≥c g >0

    The group-wise thresholds {qg}g∈G, which achieve an exact group-wise coverage level 1−α , necessarily induce a nonzero cross-group disparity in expected set size. More precisely, for every g∈ H r \ {r},ℓ g(qg)−ℓ r(qr)≥c g >0. Consequently, max g,g ′∈G |ℓg(qg)−ℓ g′(qg′)| ≥max g∈Hr\{r} cg >0.(14) Therefore, exact group-wise coverage cannot simultaneously sa...

  34. [34]

    (15) Proof

    The restricted mean squared cross-group size disparity relative to the reference grouprsatisfies D2 r = X g∈Hr\{r} pg(ℓg(qg)−ℓ r(qr))2 ≥ X g∈Hr\{r} pgc2 g >0. (15) Proof. The group-wise thresholds {qg}g∈G achieve equalized coverage at level 1−α across groups. Now fix any g∈ H r \ {r}. By definition of Hr, we have qg ≥q r. Since t7→ℓ g(t) is non-decreasing...

  35. [35]

    Custom license, see dataset download agreement,

    Under exact conservation,δ(q) = 0, we haveΩ o(q) Ωu(q)≥B 2 K. Proof. For any group with q≥q g, we have εg(q)≥m g(q−q g). Similarly, for any group with q≤q g, we haveε g(q)≤m g(q−q g). Therefore, we have Ωo(q)≥ X g wg(q−q g)+ =:A +(q) Ω u(q)≥ X g wg(qg −q) + =:A −(q).(37) At the crossing point¯qm, A+(¯qm) =A −(¯qm) = 1 2 X g wg|qg −¯qm|=B K.(38) Since A+(q...