arxiv: 2605.14260 · v1 · submitted 2026-05-14 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

On the Burden of Achieving Fairness in Conformal Prediction

Ziang Gao , Pengqi Liu , Archer Yi Yang , Mouloud Belbahri , Jesse C. Cresswell , Masoud Asgharian

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:35 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords conformal predictionfairnesscoverage distortionpooled calibrationequalized coveragequantile heterogeneityprediction setssplit conformal

0 comments

The pith

Pooled calibration in conformal prediction creates irreducible coverage distortion across groups set by quantile heterogeneity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that calibrating conformal predictors with one shared threshold hides differences in how scores are distributed across groups. It derives a conservation law proving that this shared threshold must produce coverage rates that deviate from the target for some groups, with the size of the deviation fixed by how much the groups' score quantiles differ. It further establishes that the two common fairness goals of equal coverage across groups and equal prediction-set sizes are incompatible under a single policy. These results matter because conformal methods are meant to deliver reliable uncertainty estimates, yet pooled calibration forces practitioners to accept distortion in either reliability or set size. Experiments confirm the same trade-off persists in finite samples on both synthetic and real data.

Core claim

Pooled calibration incurs irreducible group-wise coverage distortion at a scale set by cross-group quantile heterogeneity. The two leading fairness definitions, equalized coverage and equalized set size, are in fundamental tension. The choice between treating groups separately or pooling them determines whether the resulting distortion appears in the coverage or the size dimension.

What carries the argument

A conservation law relating pooled and group-wise coverage probabilities derived from the population score distributions in split conformal prediction.

Load-bearing premise

The derivations rely on population-level score distributions for each group being well-defined and independent of the training process.

What would settle it

Measure group-wise coverage and average set sizes on a dataset with known score distributions under both pooled and separate calibration; if the observed coverage gaps exactly match the quantile-heterogeneity lower bound (within finite-sample error), the bound holds, while systematic deviation would falsify it.

Figures

Figures reproduced from arXiv: 2605.14260 by Archer Yi Yang, Jesse C. Cresswell, Masoud Asgharian, Mouloud Belbahri, Pengqi Liu, Ziang Gao.

**Figure 2.** Figure 2: Two-group Gaussian pooled-threshold [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Bias in Bios mechanism view at α = 0.1 for the simple score. Panel A illustrates the pooled-threshold mechanism in Theorem 1; Panels B–C illustrate Theorems 3–4 and Corollaries 1–2; Panel D summarizes the three distortions (Theorem 2, Corollaries 1–2) for male and female groups [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: MultiNLI at α = 0.1 with simple (left) and RAPS (right) scores. For each score, Panel A shows signed coverage distortion Fˆ S|g(ˆq) − (1 − α) under pooled threshold (Theorem 1): positive bars indicate over-coverage and negative bars indicate under-coverage. Panel B shows the signed change in expected set size ˆℓg(ˆqg) − ˆℓg(ˆq) after switching to group-wise thresholds that equalize coverage (Corollary 1). … view at source ↗

**Figure 5.** Figure 5: FACET at α = 0.1 with the RAPS score; Panel A illustrates Theorem 1, and Panels B–C illustrate Corollaries 1–2. We next show the same mechanisms on FACET [13] using the RAPS score on the age group split (Younger, Middle, Older, Unknown) with a zero-shot CLIP ViT-L/14 classifier [19] [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Four multi-group families. Across all four score families, the empirical RMS miscoverage [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Imbalanced four-group pooled-threshold diagnostics. Across all four families, the weighted [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Bias in Bios mechanism view at α = 0.10 for SAPS score. Panel A illustrates the pooled-threshold mechanism in Theorem 1; Panels B–C illustrate Theorems 3– 4 and Corollaries 1–2; Panel D summarizes the three distortions for male and female groups. 0.0 0.5 1.0 1.5 2.0 True-label RAPS nonconformity score 0.0 0.2 0.4 0.6 0.8 1.0 Empirical CDF A. Calibration ECDFs and thresholds Male Female Pooled threshold Tar… view at source ↗

**Figure 9.** Figure 9: Bias in Bios mechanism view at α = 0.10 for RAPS score. Panel A illustrates the pooled-threshold mechanism in Theorem 1; Panels B–C illustrate Theorems 3– 4 and Corollaries 1–2; Panel D summarizes the three distortions for male and female groups. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Finite-calibration detectability diagnostics of the pooled-threshold floor from Theorem [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: shows that across α ∈ {0.05, 0.07, 0.085, 0.10}, the empirical pooled-threshold distortion stays above the estimated lower-bound scale, while the induced size and coverage distortions remain nonzero throughout. E.3 Controlled Genre Temperature Sweep At α = 0.1, we perturb only the facetoface genre via temperature scaling [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

**Figure 12.** Figure 12: Controlled MultiNLI temperature sweep at α = 0.10 for the simple score, perturbing the facetoface genre only. Panel A corresponds to Theorem 2. Panels B–C show Corollaries 1–2. The same trade-off mechanism remains visible across the temperature sweep. -0.024 0.000 0.024 Coverage distortion Government Oup Verbatim Travel Letters Facetoface Fiction Telephone Slate Nineeleven A Pooled threshold -0.141 0.000 … view at source ↗

**Figure 13.** Figure 13: MultiNLI at α = 0.10 using SAPS score. For this score, Panel A shows pooled quantile consequence of Theorem 1; Panels B and C illustrate the set size distortion in Corollary 1, and the coverage distortion in Corollary 2, respectively. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗

**Figure 14.** Figure 14: MultiNLI robustness across α for SAPS (top) and RAPS (bottom) scores. In each row, Panel A is best read as a finite-sample diagnostic for Theorem 2 based on an estimated lower-bound proxy, rather than a pointwise lower-bound verification. Panels B–C show Corollaries 1–2. The induced set-size and coverage distortions remain visible across the tested α-grid. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗

**Figure 15.** Figure 15: Controlled MultiNLI temperature sweep at α = 0.10 for SAPS (top) and RAPS (bottom) scores, perturbing the facetoface genre only. Panel A is best read as a finite-sample diagnostic for Theorem 2. Panels B–C illustrate Corollaries 1–2. The same trade-off mechanism remains visible across the temperature sweep. 50 100 150 200 250 300 350 400 450500 Calibration size 1.0 1.5 2.0 2.5 SNR = true floor / sd(empiri… view at source ↗

**Figure 16.** Figure 16: Finite-calibration detectability of the pooled-threshold floor from Theorem [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗

**Figure 17.** Figure 17: shows that for various target levels α ∈ {0.05, 0.07, 0.085, 0.10}, the empirical pooled-threshold distortion stays above the empirical lower bound. The induced set size distortion remains nonzero throughout and the equalized expected set size policy continues to produce a nonzero RMS coverage distortion. Next, we perturb only the Younger group through temperature scaling while keeping the rest of the gro… view at source ↗

**Figure 18.** Figure 18: Controlled FACET temperature sweep at α = 0.10 for the RAPS score, perturbing the Younger group only. Panel A evaluates the lower-bound behavior in Theorem 2. Panels B–C show Corollaries 1–2. The same trade-off mechanism remains visible across the temperature sweep. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗

read the original abstract

Conformal prediction is often calibrated with a single pooled threshold, but this can hide cross-group heterogeneity in score distributions and distort group-wise coverage. We study this phenomenon through the population score distributions underlying split conformal calibration. First, we derive a conservation law and lower bound showing that pooled calibration incurs irreducible group-wise coverage distortion at a scale set by cross-group quantile heterogeneity. Second, we demonstrate that the two leading fairness definitions for conformal prediction, Equalized Coverage and Equalized Set Size, are fundamentally in tension. Third, we quantify the cost of moving between policies which treat groups separately or pool them. Experiments on synthetic and real data confirm the same bidirectional trade-off after finite-sample calibration. Our results show that, for the policy families studied here, calibration choice does not remove cross-group heterogeneity; it determines whether the resulting distortion appears in the coverage or size dimension, providing a principled lens for analyzing fairness-oriented calibration choices in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pooled calibration in conformal prediction creates a population-level coverage distortion tied to cross-group quantile differences, with a real tension between equalized coverage and equalized set size, but the experiments do not cleanly separate that irreducible term from finite-sample error.

read the letter

The paper's core contribution is a conservation law and lower bound showing that pooled split conformal calibration forces group-wise coverage deviations whose size tracks heterogeneity in the underlying score distributions. It also shows that Equalized Coverage and Equalized Set Size pull against each other, so any policy choice simply moves the distortion from one axis to the other. The derivations start from population quantities and look independent of fitted parameters, which is a clean way to frame the problem. The synthetic and real-data runs reproduce the bidirectional trade-off after finite-sample calibration, which is useful for seeing the effect in practice. That part is worth having on record for anyone thinking about fairness constraints in regulated settings. The main limitation is that the experiments do not isolate the claimed population term from ordinary quantile estimation error. Split conformal replaces the population quantiles with empirical ones whose accuracy depends on calibration-set size and score variance per group, so the observed distortions contain both components. Without a separate bound or ablation that holds the population heterogeneity fixed while varying sample size, it remains unclear how much of the reported effect is the fundamental limit versus standard finite-sample noise. The math itself appears solid and the citation pattern is appropriate, but tightening the experiments to separate the two sources would strengthen the central claim. This is worth sending to referees for people working on conformal fairness or deploying these methods where group parity matters. The derivations give a principled lens even if the finite-sample story needs more work.

Referee Report

2 major / 2 minor

Summary. The paper claims that pooled calibration in split conformal prediction incurs an irreducible group-wise coverage distortion whose scale is set by cross-group quantile heterogeneity, as formalized by a derived population-level conservation law and lower bound. It further shows that Equalized Coverage and Equalized Set Size are in fundamental tension, quantifies the cost of moving between separate-group and pooled policies, and validates the bidirectional coverage/size trade-off on synthetic and real data after finite-sample calibration.

Significance. If the central derivations hold, the work supplies a principled population-level explanation for why fairness-oriented calibration choices in conformal prediction merely relocate rather than eliminate cross-group heterogeneity. This is a useful lens for practitioners and offers falsifiable predictions about the location of distortion under different policies.

major comments (2)

[§3] §3 (conservation law and lower bound): The derivations start from population score distributions and produce an irreducible term controlled by quantile heterogeneity. However, the finite-sample experiments in §4 replace population quantiles with empirical ones computed on finite calibration sets per group; the manuscript does not isolate or bound the additive estimation error component separately from the claimed population term, leaving open whether observed distortions are dominated by the irreducible heterogeneity or by finite-sample artifacts.
[§4] §4 (experiments): The synthetic and real-data results demonstrate the bidirectional trade-off, but without controls that vary calibration-set size while holding population heterogeneity fixed, or that report separate estimates of the population component, it is difficult to confirm that the population conservation law dominates the observed finite-sample distortions as asserted in the abstract.

minor comments (2)

[Introduction] The transition from the population derivations to the finite-sample setting could be stated more explicitly in the introduction or §2 to clarify how the lower bound is expected to manifest after empirical quantile estimation.
[§3] Notation for group-wise quantiles and coverage deviations is introduced in §3 but could be summarized in a single table for quick reference when reading the experimental results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the distinction between the population-level derivations and the finite-sample experiments. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§3] §3 (conservation law and lower bound): The derivations start from population score distributions and produce an irreducible term controlled by quantile heterogeneity. However, the finite-sample experiments in §4 replace population quantiles with empirical ones computed on finite calibration sets per group; the manuscript does not isolate or bound the additive estimation error component separately from the claimed population term, leaving open whether observed distortions are dominated by the irreducible heterogeneity or by finite-sample artifacts.

Authors: The derivations in §3 are explicitly population-level and yield an exact conservation law together with a lower bound on coverage distortion that depends only on cross-group quantile heterogeneity. The §4 experiments are intended to show that the same qualitative bidirectional trade-off appears once the population quantiles are replaced by their finite-sample conformal estimates. We acknowledge that the manuscript does not supply a separate analytic bound on the quantile estimation error. In the revision we will add a short paragraph in §4 (and a corresponding remark in the appendix) that (i) recalls the known consistency of conformal quantiles, (ii) notes that the observed distortions remain aligned in sign and approximate magnitude with the population lower bound even for moderate calibration sizes, and (iii) states that a full finite-sample decomposition is left for future work. This clarifies the relationship without altering the central claims. revision: partial
Referee: [§4] §4 (experiments): The synthetic and real-data results demonstrate the bidirectional trade-off, but without controls that vary calibration-set size while holding population heterogeneity fixed, or that report separate estimates of the population component, it is difficult to confirm that the population conservation law dominates the observed finite-sample distortions as asserted in the abstract.

Authors: The current experiments fix calibration-set sizes that are typical in practice and already vary effective sample sizes across groups in the synthetic design. While we agree that an explicit sweep over calibration size (holding the underlying score distributions fixed) would strengthen the isolation of the population term, the existing results are consistent with the population predictions across both synthetic and real data. In the revision we will add a supplementary figure that varies calibration-set size on the synthetic data and overlays the empirical distortion against the population lower bound; this will make the convergence behavior explicit and address the concern directly. revision: yes

Circularity Check

0 steps flagged

Derivation from population score distributions is self-contained

full rationale

The paper's central derivation begins from assumed population-level score distributions per group and derives a conservation law plus lower bound on coverage distortion driven by cross-group quantile heterogeneity. This step is a direct mathematical consequence of the definitions of pooled vs. group-wise quantiles and does not reduce to fitted parameters, self-referential quantities, or prior self-citations. Subsequent claims about tension between Equalized Coverage and Equalized Set Size follow from the same population quantities. Finite-sample experiments are presented as separate empirical confirmation rather than part of the derivation. No load-bearing step matches any of the enumerated circularity patterns; the population analysis stands independently of the finite-sample calibration details.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard conformal prediction assumptions about score distributions and calibration; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Population score distributions exist and are distinct across groups
Invoked to derive the conservation law and lower bound on coverage distortion.

pith-pipeline@v0.9.0 · 5473 in / 1190 out tokens · 27103 ms · 2026-05-15T02:35:56.558120+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2 (Pooled-threshold uncertainty relation): Var(εG(q)) ≥ meff(q)² Var(qG) under local density lower bounds.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

Uncer- tainty sets for image classifiers using conformal prediction

Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncer- tainty sets for image classifiers using conformal prediction. InInternational Conference on Learning Representations, 2021

work page 2021
[2]

A Convex Loss Function for Set Prediction with Optimal Trade-offs Between Size and Conditional Coverage.arXiv:2512.19142, 2025

Francis Bach. A Convex Loss Function for Set Prediction with Optimal Trade-offs Between Size and Conditional Coverage.arXiv:2512.19142, 2025

work page arXiv 2025
[3]

Fair prediction with disparate impact: A study of bias in recidivism prediction instruments.Big data, 5(2):153–163, 2017

Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments.Big data, 5(2):153–163, 2017. doi: 10.1089/big.2016.0047

work page doi:10.1089/big.2016.0047 2017
[4]

Cresswell, Yi Sui, Bhargava Kumar, and Noël V ouitsis

Jesse C. Cresswell, Yi Sui, Bhargava Kumar, and Noël V ouitsis. Conformal prediction sets improve human decision making. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[5]

Cresswell, Bhargava Kumar, Yi Sui, and Mouloud Belbahri

Jesse C. Cresswell, Bhargava Kumar, Yi Sui, and Mouloud Belbahri. Conformal prediction sets can cause disparate impact. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[6]

Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting

Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexan- dra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting. InProceedings of the Conference on Fairness, Accountability, and Transparency, page 120–128, 2019. ISBN ...

work page doi:10.1145/3287560.3287572 2019
[7]

Asymptotic minimax character of the sam- ple distribution function and of the classical multinomial estimator.The Annals of Mathematical Statistics, pages 642–669, 1956

Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sam- ple distribution function and of the classical multinomial estimator.The Annals of Mathematical Statistics, pages 642–669, 1956

work page 1956
[8]

The limits of distribution-free conditional predictive inference.Information and Inference: A Journal of the IMA, 10(2):455–482, 2021

Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. The limits of distribution-free conditional predictive inference.Information and Inference: A Journal of the IMA, 10(2):455–482, 2021

work page 2021
[9]

De finetti’s theorem and related results for infinite weighted exchangeable sequences.Bernoulli, 30(4): 3004–3028, 2024

Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. De finetti’s theorem and related results for infinite weighted exchangeable sequences.Bernoulli, 30(4): 3004–3028, 2024

work page 2024
[10]

V olume optimality in conformal prediction with structured prediction sets

Chao Gao, Liren Shan, Vaidehi Srinivas, and Aravindan Vijayaraghavan. V olume optimality in conformal prediction with structured prediction sets. InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 18495–18527, 2025

work page 2025
[11]

Conformal prediction with conditional guarantees.Journal of the Royal Statistical Society Series B: Statistical Methodology, 87(4): 1100–1126, 03 2025

Isaac Gibbs, John J Cherian, and Emmanuel J Candès. Conformal prediction with conditional guarantees.Journal of the Royal Statistical Society Series B: Statistical Methodology, 87(4): 1100–1126, 03 2025. ISSN 1369-7412. doi: 10.1093/jrsssb/qkaf008

work page doi:10.1093/jrsssb/qkaf008 2025
[12]

Counterfactually fair conformal prediction

Ozgur Guldogan, Neeraj Sarna, Yuanyuan Li, and Michael Berger. Counterfactually fair conformal prediction. InProceedings of The 29th International Conference on Artificial Intelligence and Statistics, 2026

work page 2026
[13]

FACET: Fairness in computer vision evaluation benchmark

Laura Gustafson, Chloe Rolland, Nikhila Ravi, Quentin Duval, Aaron Adcock, Cheng-Yang Fu, Melissa Hall, and Candace Ross. FACET: Fairness in computer vision evaluation benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20370– 20382, 2023

work page 2023
[14]

Equality of opportunity in supervised learning

Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems 29, pages 3315–3323, 2016

work page 2016
[15]

Inherent trade-offs in the fair determination of risk scores

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In8th Innovations in Theoretical Computer Science Conference, volume 67, pages 43:1–43:23, 2017. doi: 10.4230/LIPIcs.ITCS.2017.43

work page doi:10.4230/lipics.itcs.2017.43 2017
[16]

Claire Lazar Reich and Suhas Vijaykumar. A Possibility in Algorithmic Fairness: Can Calibra- tion and Equal Error Rates Be Reconciled? In2nd Symposium on Foundations of Responsible Computing, volume 192, pages 4:1–4:21, 2021. doi: 10.4230/LIPIcs.FORC.2021.4. 11

work page doi:10.4230/lipics.forc.2021.4 2021
[17]

Conformal- ized fairness via quantile regression.Advances in Neural Information Processing Systems, 35: 11561–11572, 2022

Meichen Liu, Lei Ding, Dengdeng Yu, Wulong Liu, Linglong Kong, and Bei Jiang. Conformal- ized fairness via quantile regression.Advances in Neural Information Processing Systems, 35: 11561–11572, 2022

work page 2022
[18]

Cresswell

Pengqi Liu, Zijun Yu, Mouloud Belbahri, Arthur Charpentier, Masoud Asgharian, and Jesse C. Cresswell. Beyond procedure: Substantive fairness in conformal prediction. InProceedings of the 43rd International Conference on Machine Learning, 2026. To appear

work page 2026
[19]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 8748–8763, 2021

work page 2021
[20]

With malice toward none: Assessing uncertainty via equalized coverage.Harvard Data Science Review, 2(2):4, 2020

Yaniv Romano, Rina Foygel Barber, Chiara Sabatti, and Emmanuel Candès. With malice toward none: Assessing uncertainty via equalized coverage.Harvard Data Science Review, 2(2):4, 2020

work page 2020
[21]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[22]

A Tutorial on Conformal Prediction.Journal of Machine Learning Research, 9(12):371–421, 2008

Glenn Shafer and Vladimir V ovk. A Tutorial on Conformal Prediction.Journal of Machine Learning Research, 9(12):371–421, 2008

work page 2008
[23]

Routledge, 2018

Bernard W Silverman.Density estimation for statistics and data analysis. Routledge, 2018

work page 2018
[24]

The coverage-deferral trade-off: Fairness implications of conformal predic- tion in human-in-the-loop decision systems.Preprints, 2025

Davut Emre Tasar. The coverage-deferral trade-off: Fairness implications of conformal predic- tion in human-in-the-loop decision systems.Preprints, 2025. doi: 10.20944/preprints202512. 2631.v1

work page doi:10.20944/preprints202512 2025
[25]

Vadlamani, Anutam Srinivasan, Pranav Maneriker, Ali Payani, and Srinivasan Parthasarathy

Aditya T. Vadlamani, Anutam Srinivasan, Pranav Maneriker, Ali Payani, and Srinivasan Parthasarathy. A generic framework for conformal fairness. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[26]

Mondrian confidence machine.Technical Report, 2003

Vladimir V ovk, David Lindsay, Ilia Nouretdinov, and Alex Gammerman. Mondrian confidence machine.Technical Report, 2003

work page 2003
[27]

Springer, 2005

Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic learning in a random world. Springer, 2005

work page 2005
[28]

Equal opportunity of coverage in fair regression.Advances in Neural Information Processing Systems, 36:7743–7755, 2023

Fangxin Wang, Lu Cheng, Ruocheng Guo, Kay Liu, and Philip S Yu. Equal opportunity of coverage in fair regression.Advances in Neural Information Processing Systems, 36:7743–7755, 2023

work page 2023
[29]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, 2018. doi: 10.18653/v1/N18-1101

work page doi:10.18653/v1/n18-1101 2018
[30]

Conformal classification with equalized coverage for adaptively selected groups.Advances in Neural Information Processing Systems, 37:108760–108823, 2024

Yanfei Zhou and Matteo Sesia. Conformal classification with equalized coverage for adaptively selected groups.Advances in Neural Information Processing Systems, 37:108760–108823, 2024. 12 Appendix Contents • Appendix A: Technical Discussion • Appendix B: Proofs of Theoretical Results • Appendix C: Additional Experimental Details • Appendix D:Bias in BiosE...

work page 2024
[31]

Under exact conservation,δ(q) = 0, we haveΩ o(q) Ωu(q)≥B 2 K. When q is the pooled population quantile and the mixture CDF FS is continuous at q, Theorem 1 givesδ(q) = 0, so the exact-conservation form in part 3 is the relevant pooled-calibration case. Theorem 6 refines Theorem 1 from a signed additive conservation law to a magnitude lower bound. The prod...

work page
[32]

The incompatibility follows from a Bayes coupling identity relating predictive values, base rates, and the likelihood ratio TPRg/FPRg

shows that in binary settings, when two groups have different base rates,πg =P(Y= 1|G=g) , predictive parity, i.e., equalPPVg =P(Y= 1| ˆY= 1, G=g) across groups, cannot generally hold simultaneously with equalized error profiles matching the false positive rates FPRg =P( ˆY= 1| Y= 0, G=g) and the true positive rates TPRg =P( ˆY= 1|Y= 1, G=g) . The incompa...

work page
[33]

More precisely, for every g∈ H r \ {r},ℓ g(qg)−ℓ r(qr)≥c g >0

The group-wise thresholds {qg}g∈G, which achieve an exact group-wise coverage level 1−α , necessarily induce a nonzero cross-group disparity in expected set size. More precisely, for every g∈ H r \ {r},ℓ g(qg)−ℓ r(qr)≥c g >0. Consequently, max g,g ′∈G |ℓg(qg)−ℓ g′(qg′)| ≥max g∈Hr\{r} cg >0.(14) Therefore, exact group-wise coverage cannot simultaneously sa...

work page
[34]

(15) 17 Proof

The restricted mean squared cross-group size disparity relative to the reference grouprsatisfies D2 r = X g∈Hr\{r} pg(ℓg(qg)−ℓ r(qr))2 ≥ X g∈Hr\{r} pgc2 g >0. (15) 17 Proof. The group-wise thresholds {qg}g∈G achieve equalized coverage at level 1−α across groups. Now fix any g∈ H r \ {r}. By definition of Hr, we have qg ≥q r. Since t7→ℓ g(t) is non-decreas...

work page
[35]

Custom license, see dataset download agreement,

Under exact conservation,δ(q) = 0, we haveΩ o(q) Ωu(q)≥B 2 K. Proof. For any group with q≥q g, we have εg(q)≥m g(q−q g). Similarly, for any group with q≤q g, we haveε g(q)≤m g(q−q g). Therefore, we have Ωo(q)≥ X g wg(q−q g)+ =:A +(q) Ω u(q)≥ X g wg(qg −q) + =:A −(q).(37) At the crossing point¯qm, A+(¯qm) =A −(¯qm) = 1 2 X g wg|qg −¯qm|=B K.(38) Since A+(q...

work page 2000