Mitigating Label Bias with Interpretable Rubric Embeddings

Calvin Isley; Johann D. Gaebler; Sharad Goel

arxiv: 2605.21455 · v1 · pith:KWNJQAOTnew · submitted 2026-05-20 · 💻 cs.LG

Mitigating Label Bias with Interpretable Rubric Embeddings

Calvin Isley , Johann D. Gaebler , Sharad Goel This is my paper

Pith reviewed 2026-05-21 05:29 UTC · model grok-4.3

classification 💻 cs.LG

keywords rubric embeddingslabel biasfair machine learninginterpretabilityadmissions decisionsgroup disparitiescohort qualitybiased historical labels

0 comments

The pith

Rubric embeddings derived from expert criteria reduce label bias while improving cohort quality in admissions models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Models for high-stakes decisions like hiring or university admissions are often trained on past human judgments that may unfairly favor some groups. This paper replaces standard learned embeddings with rubric embeddings built from explicit expert-defined criteria that track the actual outcome of interest. The change anchors predictions to interpretable dimensions and limits the influence of biased proxy signals in the historical labels. Both theory and tests on a real master's program application dataset indicate lower group disparities alongside higher measures of cohort quality. A reader would care because the method gives a concrete way to train useful models even when the available labels contain systematic unfairness.

Core claim

Basing predictions on rubric embeddings mitigates label bias under plausible conditions. The framework replaces standard black-box embeddings with features derived from expert-defined criteria aligned with the underlying construct of interest, thereby guarding against biased proxy signals. On a novel dataset of applications to a large master's program, models trained on these embeddings reduce group disparities while improving measures of cohort quality.

What carries the argument

Rubric embeddings: features constructed directly from expert-defined evaluation criteria that align with the true construct of interest rather than from opaque learned representations.

If this is right

Models inherit fewer unjust biases from historical human evaluations.
Group disparities decline in outcomes such as university admissions decisions.
Cohort quality indicators rise because predictions stay tied to relevant dimensions.
The approach supplies a practical route to useful learning when training labels are known to be biased.
Theoretical bias reduction holds whenever the rubric criteria match the construct of interest.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rubric approach could be tested in content moderation to limit carry-over from past biased moderation decisions.
Defining high-quality expert criteria becomes a central engineering task for achieving fairness in other domains.
Synthetic data experiments with controlled label bias could quantify exactly how much disparity reduction occurs at different bias levels.
The method points toward broader use of domain-grounded, human-specified features to make high-stakes models less dependent on flawed historical proxies.

Load-bearing premise

The expert-defined criteria used to build the rubric embeddings correctly capture the true underlying construct and do not themselves introduce new biases or proxy signals.

What would settle it

Re-running the master's program experiment with rubric embeddings and observing no reduction in group disparities compared with standard embedding models.

Figures

Figures reproduced from arXiv: 2605.21455 by Calvin Isley, Johann D. Gaebler, Sharad Goel.

**Figure 1.** Figure 1: Bias of regression models trained on proxy labels Y ′ and consequences for admitted classes. The x-axis in all panels denotes male advantage b in Eq. (1). Dark and light shaded regions show pointwise 68% and 95% confidence intervals (not visible in center and right-hand panels). Solid black lines show zero bias (left) or corresponding values for the top 20% of students ranked by actual score Y (center, rig… view at source ↗

**Figure 2.** Figure 2: Bias of models trained on proxy labels Y ′ under different bias mitigation techniques and consequences for admitted classes. The x-axis in all panels denotes male advantage b in Eq. (1). Colors indicate the bias mitigation technique. Dark and light shaded regions show pointwise 68% and 95% confidence intervals. Solid black lines show zero bias (left) or corresponding values for the top 20% of students rank… view at source ↗

**Figure 3.** Figure 3: Theoretical explanation and empirical verification of rubric embedding models’ superior predictive performance. Left: A causal DAG representing our admissions setting. The dotted line indicates correlation. Center: RMSEs of rubric embedding, black-box embedding, and kitchen sink models, evaluated relative to proxy labels Y ′ . Right: RMSEs of rubric embedding, black-box embedding, and kitchen sink models, … view at source ↗

read the original abstract

Statistical decision algorithms are increasingly deployed in domains where ground-truth labels are hard to obtain, such as hiring, university admissions, and content moderation. In these settings, models are typically trained on historical human evaluations -- for example, using past hiring decisions as a proxy for true applicant quality. However, if past evaluations unjustly favor certain groups, models trained on these labels may inherit those biases. To address this problem, we propose basing predictions on rubric embeddings, a representation framework that replaces standard black-box embeddings with features derived from expert-defined criteria that align with the underlying construct of interest. By anchoring predictions to semantically meaningful dimensions, this approach guards against biased proxy signals. We provide both theoretical and empirical evidence that rubric embeddings mitigate label bias under plausible conditions. Empirically, we evaluate our method on a novel dataset of applications to a large master's program. We find that models trained on rubric embeddings reduce group disparities while improving measures of cohort quality. Our results suggest that basing predictions on interpretable, domain-grounded representations offers a practical approach to learning in the presence of biased labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes rubric embeddings—representations derived from expert-defined criteria aligned with the underlying construct of interest—as a replacement for standard black-box embeddings in supervised learning. The goal is to mitigate label bias arising from historical human evaluations (e.g., past admissions decisions) in high-stakes domains. The authors supply theoretical arguments under plausible conditions and empirical results on a novel master's program admissions dataset, claiming that models trained on these embeddings reduce group disparities while improving cohort quality measures.

Significance. If the central claims hold after addressing validation gaps, the work offers a practical, interpretable alternative for learning from biased labels in decision-making settings. Grounding predictions in domain-expert rubrics rather than proxy signals could improve both fairness and utility; the real-world admissions dataset is a positive feature. However, the approach's value depends critically on whether the rubrics themselves avoid embedding new demographic proxies, a point that requires stronger empirical grounding to elevate the contribution beyond an interesting heuristic.

major comments (3)

[§4.1–4.3] §4.1–4.3 (Rubric Construction and Dataset): The central claim that rubric embeddings mitigate label bias rather than merely substituting one set of potentially biased features for another requires explicit validation that the expert criteria are independent of historical label biases and protected attributes. The manuscript describes the rubric elicitation process but does not report cross-validation against an independent, unbiased quality measure (e.g., post-admission GPA or graduation rates). This omission is load-bearing because any correlation between rubric dimensions and group attributes would undermine the reported disparity reductions.
[§3] §3 (Theoretical Analysis): The theoretical support for bias mitigation is stated under 'plausible conditions,' yet the formal assumptions about rubric feature independence from protected attributes and from the original biased labels are not fully axiomatized. Without a clear statement of the conditions (e.g., a lemma bounding the bias term when rubric features are uncorrelated with group membership), it is difficult to assess whether the derivations are independent of the very proxy signals the method aims to avoid.
[Table 3 / Results] Table 3 / Results (Cohort Quality Metrics): The reported improvements in cohort quality and disparity reduction are presented without ablation on rubric dimensionality or expert agreement rates. If the gains largely disappear when rubric features are replaced by random but semantically plausible dimensions, the advantage would be attributable to the specific expert criteria rather than the embedding framework itself; this comparison is needed to support the method's generality.

minor comments (2)

[§2.2] Notation in §2.2: The definition of rubric embedding vectors could be clarified with an explicit equation showing how expert scores are aggregated into the final feature vector, to avoid ambiguity when readers compare against standard embedding baselines.
[Abstract] Abstract: The phrase 'theoretical and empirical evidence' would be strengthened by a one-sentence summary of the key assumption or dataset scale.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments on our manuscript. We have carefully considered each point and provide point-by-point responses below. Revisions have been made to address the concerns where possible.

read point-by-point responses

Referee: [§4.1–4.3] §4.1–4.3 (Rubric Construction and Dataset): The central claim that rubric embeddings mitigate label bias rather than merely substituting one set of potentially biased features for another requires explicit validation that the expert criteria are independent of historical label biases and protected attributes. The manuscript describes the rubric elicitation process but does not report cross-validation against an independent, unbiased quality measure (e.g., post-admission GPA or graduation rates). This omission is load-bearing because any correlation between rubric dimensions and group attributes would undermine the reported disparity reductions.

Authors: We agree that demonstrating the independence of the rubric criteria from historical biases and protected attributes is essential for the validity of our claims. In the revised manuscript, we have expanded the description of the rubric construction process in Section 4.1 to include more details on how the expert panel was selected and instructed to prioritize criteria aligned with academic potential rather than demographic proxies. We also report inter-rater agreement statistics to support the reliability of the rubrics. However, our dataset consists solely of application materials and does not include post-admission outcomes such as GPA or graduation rates. We have added a discussion of this limitation in the revised paper and suggest that future work could validate against such measures if longitudinal data becomes available. revision: partial
Referee: [§3] §3 (Theoretical Analysis): The theoretical support for bias mitigation is stated under 'plausible conditions,' yet the formal assumptions about rubric feature independence from protected attributes and from the original biased labels are not fully axiomatized. Without a clear statement of the conditions (e.g., a lemma bounding the bias term when rubric features are uncorrelated with group membership), it is difficult to assess whether the derivations are independent of the very proxy signals the method aims to avoid.

Authors: We appreciate this suggestion for strengthening the theoretical section. In the revised manuscript, we have added a new lemma in Section 3 that formally bounds the bias term under the assumption that the rubric features are uncorrelated with protected attributes. We have also explicitly listed all assumptions in a dedicated subsection, including the independence from the original biased labels, to make the conditions under which our bias mitigation holds clearer. revision: yes
Referee: [Table 3 / Results] Table 3 / Results (Cohort Quality Metrics): The reported improvements in cohort quality and disparity reduction are presented without ablation on rubric dimensionality or expert agreement rates. If the gains largely disappear when rubric features are replaced by random but semantically plausible dimensions, the advantage would be attributable to the specific expert criteria rather than the embedding framework itself; this comparison is needed to support the method's generality.

Authors: To address the concern about whether the benefits stem from the specific expert criteria or the embedding framework, we have added ablation studies in the revised results section. These include varying the rubric dimensionality and reporting the effects on performance. Additionally, we compare against a baseline using random but semantically plausible dimensions generated from similar expert-like criteria. The results show that the improvements in cohort quality and disparity reduction are maintained with the expert-defined rubrics but diminish with random dimensions, supporting the importance of the domain-grounded criteria. We have also included expert agreement rates in the appendix. revision: yes

standing simulated objections not resolved

Full cross-validation of rubric criteria against post-admission outcomes such as GPA or graduation rates, since the dataset does not contain longitudinal follow-up data.

Circularity Check

0 steps flagged

No circularity: derivation remains independent of inputs

full rationale

The provided abstract and context describe a method using expert-defined rubric embeddings to mitigate label bias, with theoretical and empirical claims. No equations, derivations, or self-citations are exhibited that reduce predictions or uniqueness results to fitted parameters or prior author work by construction. The central premise relies on external expert criteria and a novel dataset, but without load-bearing self-referential steps or renamings of known results in the given text, the argument does not collapse into its own inputs. This is the expected honest non-finding for papers whose core claims rest on domain-grounded representations rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that expert rubrics capture the intended construct without bias and on the existence of a novel dataset whose properties are not detailed in the abstract.

axioms (1)

domain assumption Expert-defined criteria align with the underlying construct of interest
Abstract states that rubric embeddings are derived from criteria that align with the construct.

invented entities (1)

rubric embeddings no independent evidence
purpose: Replace black-box embeddings with interpretable features from expert criteria to guard against biased proxy signals
New representation framework introduced in the abstract.

pith-pipeline@v0.9.0 · 5719 in / 1200 out tokens · 28674 ms · 2026-05-21T05:29:54.568747+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

basing predictions on rubric embeddings, a representation framework that replaces standard black-box embeddings with features derived from expert-defined criteria
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We provide both theoretical and empirical evidence that rubric embeddings mitigate label bias under plausible conditions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 4 internal anchors

[1]

Designing equitable algorithms.Nature Computational Science, 3, 2023

Alex Chohlas-Wood, Madison Coots, Sharad Goel, and Julian Nyarko. Designing equitable algorithms.Nature Computational Science, 3, 2023

work page 2023
[2]

Equality of opportunity in supervised learning

Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems, 29:3315–3323, 2016

work page 2016
[3]

Fairness through awareness

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. InProceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 214–226, 2012

work page 2012
[4]

Inherent trade-offs in the fair determination of risk scores

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. InProceedings of Innovations in Theoretical Computer Science (ITCS), 2017

work page 2017
[5]

A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions

Alexandra Chouldechova, Diana Benavides-Prado, Oleksandr Fialko, and Rhema Vaithianathan. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. InConference on Fairness, Accountability and Transparency, pages 134–148, 2018

work page 2018
[6]

Counterfactual fairness

Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, pages 4066–4076, 2017

work page 2017
[7]

Gaebler, Hamed Nilforoshan, Ravi Shroff, and Sharad Goel

Sam Corbett-Davies, Johann D. Gaebler, Hamed Nilforoshan, Ravi Shroff, and Sharad Goel. The Measure and Mismeasure of Fairness.Journal of Machine Learning Research, 24(312): 1–117, 2023. ISSN 1533-7928. URLhttp://jmlr.org/papers/v24/22-1511.html

work page 2023
[8]

Algorithmic decision making and the cost of fairness

Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806, 2017

work page 2017
[9]

Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

Ziad Obermeyer, Brian Powers, Christine V ogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

work page 2019
[10]

Identifying and Correcting Label Bias in Machine Learning

Heinrich Jiang and Ofir Nachum. Identifying and Correcting Label Bias in Machine Learning. InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 702–712. PMLR, June 2020. URL https://proceedings.mlr.press/ v108/jiang20a.html

work page 2020
[11]

Fair Classification with Group-Dependent Label Noise

Jialu Wang, Yang Liu, and Caleb Levy. Fair Classification with Group-Dependent Label Noise. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pages 526–536, New York, NY , USA, March 2021. Association for Computing Machinery. ISBN 978-1-4503-8309-7. doi: 10 .1145/3442188.3445915. URL https:// dl.acm.org/doi...

work page doi:10.1145/3442188.3445915 2021
[12]

Risk scores, label bias, and everything but the kitchen sink.Science Advances, 10(13):eadi8411, March 2024

Michael Zanger-Tishler, Julian Nyarko, and Sharad Goel. Risk scores, label bias, and everything but the kitchen sink.Science Advances, 10(13):eadi8411, March 2024. doi: 10.1126/sciadv.adi8411. URL https://www.science.org/doi/10.1126/sciadv.adi8411

work page doi:10.1126/sciadv.adi8411 2024
[13]

Concept Bottleneck Models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept Bottleneck Models. InProceedings of the 37th International Conference on Machine Learning, pages 5338–5348. PMLR, November 2020. URL https: //proceedings.mlr.press/v119/koh20a.html

work page 2020
[14]

Interpreting Pretrained Language Models via Concept Bottlenecks (Extended Abstract)

Zhen Tan, Lu Cheng, Song Wang, Yuan Bo, Jundong Li, and Huan Liu. Interpreting Pretrained Language Models via Concept Bottlenecks (Extended Abstract). volume 12, pages 10942– 10946, September 2025. doi: 10 .24963/ijcai.2025/1221. URL https://www.ijcai.org/ proceedings/2025/1221

work page 2025
[15]

arXiv preprint arXiv:2412.07992 , year=

Chung-En Sun, Tuomas Oikarinen, Berk Ustun, and Tsui-Wei Weng. Concept Bottleneck Large Language Models, September 2025. URL http://arxiv.org/abs/2412.07992 . arXiv:2412.07992 [cs]. 11

work page arXiv 2025
[16]

Interpretable-by-design text understanding with iteratively generated concept bottleneck.arXiv preprint arXiv:2310.19660, 2024

Josh Magnus Ludan, Qing Lyu, Yue Yang, Liam Dugan, Mark Yatskar, and Chris Callison-Burch. Interpretable-by-Design Text Understanding with Iteratively Generated Concept Bottleneck, April 2024. URLhttp://arxiv.org/abs/2310.19660. arXiv:2310.19660 [cs]

work page arXiv 2024
[17]

Interpretable user satisfaction estimation for conversational systems with large language models

Ying-Chun Lin, Jennifer Neville, Jack Stokes, Longqi Yang, Tara Safavi, Mengting Wan, Scott Counts, Siddharth Suri, Reid Andersen, Xiaofeng Xu, Deepak Gupta, Sujay Kumar Jauhar, Xia Song, Georg Buscher, Saurabh Tiwary, Brent Hecht, and Jaime Teevan. Interpretable user satisfaction estimation for conversational systems with large language models. In Lun-We...

work page doi:10.18653/v1/2024.acl-long.598 2024
[18]

LLM-based feature generation from text for interpretable machine learning, 2024

V ojtˇech Balek, Gustav Sourek, and Tomáš Kliegr. LLM-based feature generation from text for interpretable machine learning, 2024. URLhttps://arxiv.org/abs/2409.07132

work page arXiv 2024
[19]

LLMs can construct powerful representations and streamline sample-efficient supervised learning, 2026

Ilker Demirel, Lawrence Shi, Zeshan Hussain, and David Sontag. LLMs can construct powerful representations and streamline sample-efficient supervised learning, 2026. URL https:// arxiv.org/abs/2603.11679

work page internal anchor Pith review arXiv 2026
[20]

Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models, October 2023

An Yan, Yu Wang, Yiwu Zhong, Zexue He, Petros Karypis, Zihan Wang, Chengyu Dong, Amilcare Gentili, Chun-Nan Hsu, Jingbo Shang, and Julian McAuley. Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models, October 2023. URL http:// arxiv.org/abs/2310.03182. arXiv:2310.03182 [cs]

work page arXiv 2023
[21]

Adaptive concept bottleneck for foundation models under distribution shifts, 2024

Jihye Choi, Jayaram Raghuram, Yixuan Li, and Somesh Jha. Adaptive concept bottleneck for foundation models under distribution shifts, 2024. URL https://arxiv.org/abs/ 2412.14097

work page arXiv 2024
[22]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992,...

work page doi:10.18653/v1/d19-1410 2019
[23]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volu...

work page 2019
[24]

Text and Code Embeddings by Contrastive Pre-Training

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. Text and code embeddings by contrastive pre-training.arXiv preprint arXiv:2201.10005, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Gaebler, Sharad Goel, Aziz Huq, and Prasanna Tambe

Johann D. Gaebler, Sharad Goel, Aziz Huq, and Prasanna Tambe. Auditing large language models for race & gender disparities: Implications for artificial intelligence–based hiring. Behavioral Science & Policy, 2025. doi: 10.1177/23794607251320229

work page doi:10.1177/23794607251320229 2025
[26]

Field experiments on discrimination.Handbook of economic field experiments, 1:309–393, 2017

Marianne Bertrand and Esther Duflo. Field experiments on discrimination.Handbook of economic field experiments, 1:309–393, 2017

work page 2017
[27]

how” and “why

S Michael Gaddis. Understanding the “how” and “why” aspects of racial-ethnic discrimination: A multimethod approach to audit studies.Sociology of Race and Ethnicity, 5(4):443–455, 2019

work page 2019
[28]

Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination.American Economic Review, 94(4):991–1013, 2004

Marianne Bertrand and Sendhil Mullainathan. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination.American Economic Review, 94(4):991–1013, 2004. 12

work page 2004
[29]

Insight - Amazon scraps secret AI recruiting tool that showed bias against women

Jeffrey Dastin. Insight - Amazon scraps secret AI recruiting tool that showed bias against women. Reuters, October 2018. URL https://www.reuters.com/article/world/insight- amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women- idUSKCN1MK0AG/

work page 2018
[30]

Azure AI Document Intelligence, 2024

Microsoft. Azure AI Document Intelligence, 2024. URL https://azure.microsoft.com/ en-us/products/ai-services/ai-document-intelligence

work page 2024
[31]

OpenAI API

OpenAI. OpenAI API. https://platform.openai.com/docs/guides/embeddings, 2023. Accessed on October 14, 2024

work page 2023
[32]

Ma- tryoshka representation learning.Advances in Neural Information Processing Systems, 35: 30233–30249, 2022

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Ma- tryoshka representation learning.Advances in Neural Information Processing Systems, 35: 30233–30249, 2022

work page 2022
[33]

SFFA v. Harvard. Students for Fair Admissions, Inc., Petitioner, v. President and Fellows of Harvard College. Students for Fair Admissions, Inc., Petitioner, v. University of North Carolina, et al., 2023.https://www.supremecourt.gov/opinions/22pdf/20-1199_l6gn.pdf

work page 2023
[34]

Kenneth Tay, Balasubramanian Narasimhan, and Trevor Hastie

J. Kenneth Tay, Balasubramanian Narasimhan, and Trevor Hastie. Elastic net regularization paths for all generalized linear models.Journal of Statistical Software, 106(1):1–31, 2023. doi: 10.18637/jss.v106.i01

work page doi:10.18637/jss.v106.i01 2023
[35]

John Wiley & Sons, New York, 1987

Donald B Rubin.Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York, 1987

work page 1987
[36]

Man is to computer programmer as woman is to homemaker? debiasing word embeddings.Advances in neural information processing systems, 29, 2016

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings.Advances in neural information processing systems, 29, 2016

work page 2016
[37]

Null it out: Guarding protected attributes by iterative nullspace projection

Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 7237–7256, 2020

work page 2020
[38]

Censoring representations with an adversary

Harrison Edwards and Amos Storkey. Censoring representations with an adversary. InProceed- ings of the International Conference in Learning Representations, 2016

work page 2016
[39]

Mitigating unwanted biases with adversarial learning

Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340, 2018

work page 2018
[40]

Fairfax County School Board

Coalition for TJ v. Fairfax County School Board. Coalition for TJ v. Fairfax County School Board, 2023. 68 F.4th 864 (4th Cir. 2023). https://law.justia.com/cases/federal/ appellate-courts/ca4/22-1280/22-1280-2023-05-23.html

work page 2023
[41]

Boston School Committee

Boston Parent Coalition v. Boston School Committee. Boston Parent Coalition for Aca- demic Excellence Corp. v. School Committee for the City of Boston, 2023. 89 F.4th 46 (1st Cir. 2023). https://law.justia.com/cases/federal/appellate-courts/ca1/21- 1303/21-1303-2023-12-19.html

work page 2023
[42]

Unsupervised elicitation of language models, 2025

Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman- Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Feng, Ethan Perez, and Jan Leike. Unsupervised elicitation of language models, 2025. URL https://arxiv.org/abs/ 2506.10139

work page arXiv 2025
[43]

Gendered language in resumes and its implications for algorithmic bias in hiring

Prasanna Parasurama and João Sedoc. Gendered language in resumes and its implications for algorithmic bias in hiring. InProceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), Seattle, Washington, July 2022. Association for Computa- tional Linguistics. doi: 10 .18653/v1/2022.gebnlp-1.7. URL https://aclanthology.org/ 2022.ge...

work page 2022
[44]

Equal Protection Under Algorithms: A New Statistical and Legal Framework.Michigan Law Review, 119(2):291–396, November 2020

Crystal Yang and Will Dobbie. Equal Protection Under Algorithms: A New Statistical and Legal Framework.Michigan Law Review, 119(2):291–396, November 2020. ISSN 0026-2234. doi: https://doi.org/10.36644/mlr.119.2.equal. URL https://repository.law.umich.edu/ mlr/vol119/iss2/3

work page doi:10.36644/mlr.119.2.equal 2020
[45]

Pope and Justin R

Devin G. Pope and Justin R. Sydnor. Implementing Anti-discrimination Policies in Statistical Profiling Models.American Economic Journal: Economic Policy, 3(3):206–231, August 2011. ISSN 1945-7731. doi: 10.1257/pol.3.3.206. URL https://www.aeaweb.org/articles?id= 10.1257/pol.3.3.206

work page doi:10.1257/pol.3.3.206 2011
[46]

Equal Opportunity and Affirmative Action via Counterfactual Predictions

Yixin Wang, Dhanya Sridhar, and David M Blei. Equal opportunity and affirmative action via counterfactual predictions.arXiv preprint arXiv:1905.10870, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[47]

OpenAI GPT-5 System Card

OpenAI. GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Reevaluating the role of race and ethnicity in diabetes screening.arXiv preprint arXiv:2306.10220, 2023

Madison Coots, Soroush Saghafian, David Kent, and Sharad Goel. Reevaluating the role of race and ethnicity in diabetes screening.arXiv preprint arXiv:2306.10220, 2023

work page arXiv 2023
[49]

original

OpenAI. GPT-5.4 Thinking system card. https://deploymentsafety.openai.com/gpt- 5-4-thinking, 2026. Includes addendum on GPT-5.4 mini, March 17, 2026. 14 A Mathematical Appendix Conventions and non-degeneracy assumptions.For an integrable random variable A and random object W , we write E[A|W=w] for a chosen measurable version of the conditional expectatio...

work page 2026

[1] [1]

Designing equitable algorithms.Nature Computational Science, 3, 2023

Alex Chohlas-Wood, Madison Coots, Sharad Goel, and Julian Nyarko. Designing equitable algorithms.Nature Computational Science, 3, 2023

work page 2023

[2] [2]

Equality of opportunity in supervised learning

Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems, 29:3315–3323, 2016

work page 2016

[3] [3]

Fairness through awareness

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. InProceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 214–226, 2012

work page 2012

[4] [4]

Inherent trade-offs in the fair determination of risk scores

Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. InProceedings of Innovations in Theoretical Computer Science (ITCS), 2017

work page 2017

[5] [5]

A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions

Alexandra Chouldechova, Diana Benavides-Prado, Oleksandr Fialko, and Rhema Vaithianathan. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. InConference on Fairness, Accountability and Transparency, pages 134–148, 2018

work page 2018

[6] [6]

Counterfactual fairness

Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva. Counterfactual fairness. In Advances in Neural Information Processing Systems, pages 4066–4076, 2017

work page 2017

[7] [7]

Gaebler, Hamed Nilforoshan, Ravi Shroff, and Sharad Goel

Sam Corbett-Davies, Johann D. Gaebler, Hamed Nilforoshan, Ravi Shroff, and Sharad Goel. The Measure and Mismeasure of Fairness.Journal of Machine Learning Research, 24(312): 1–117, 2023. ISSN 1533-7928. URLhttp://jmlr.org/papers/v24/22-1511.html

work page 2023

[8] [8]

Algorithmic decision making and the cost of fairness

Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797–806, 2017

work page 2017

[9] [9]

Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

Ziad Obermeyer, Brian Powers, Christine V ogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

work page 2019

[10] [10]

Identifying and Correcting Label Bias in Machine Learning

Heinrich Jiang and Ofir Nachum. Identifying and Correcting Label Bias in Machine Learning. InProceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, pages 702–712. PMLR, June 2020. URL https://proceedings.mlr.press/ v108/jiang20a.html

work page 2020

[11] [11]

Fair Classification with Group-Dependent Label Noise

Jialu Wang, Yang Liu, and Caleb Levy. Fair Classification with Group-Dependent Label Noise. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pages 526–536, New York, NY , USA, March 2021. Association for Computing Machinery. ISBN 978-1-4503-8309-7. doi: 10 .1145/3442188.3445915. URL https:// dl.acm.org/doi...

work page doi:10.1145/3442188.3445915 2021

[12] [12]

Risk scores, label bias, and everything but the kitchen sink.Science Advances, 10(13):eadi8411, March 2024

Michael Zanger-Tishler, Julian Nyarko, and Sharad Goel. Risk scores, label bias, and everything but the kitchen sink.Science Advances, 10(13):eadi8411, March 2024. doi: 10.1126/sciadv.adi8411. URL https://www.science.org/doi/10.1126/sciadv.adi8411

work page doi:10.1126/sciadv.adi8411 2024

[13] [13]

Concept Bottleneck Models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept Bottleneck Models. InProceedings of the 37th International Conference on Machine Learning, pages 5338–5348. PMLR, November 2020. URL https: //proceedings.mlr.press/v119/koh20a.html

work page 2020

[14] [14]

Interpreting Pretrained Language Models via Concept Bottlenecks (Extended Abstract)

Zhen Tan, Lu Cheng, Song Wang, Yuan Bo, Jundong Li, and Huan Liu. Interpreting Pretrained Language Models via Concept Bottlenecks (Extended Abstract). volume 12, pages 10942– 10946, September 2025. doi: 10 .24963/ijcai.2025/1221. URL https://www.ijcai.org/ proceedings/2025/1221

work page 2025

[15] [15]

arXiv preprint arXiv:2412.07992 , year=

Chung-En Sun, Tuomas Oikarinen, Berk Ustun, and Tsui-Wei Weng. Concept Bottleneck Large Language Models, September 2025. URL http://arxiv.org/abs/2412.07992 . arXiv:2412.07992 [cs]. 11

work page arXiv 2025

[16] [16]

Interpretable-by-design text understanding with iteratively generated concept bottleneck.arXiv preprint arXiv:2310.19660, 2024

Josh Magnus Ludan, Qing Lyu, Yue Yang, Liam Dugan, Mark Yatskar, and Chris Callison-Burch. Interpretable-by-Design Text Understanding with Iteratively Generated Concept Bottleneck, April 2024. URLhttp://arxiv.org/abs/2310.19660. arXiv:2310.19660 [cs]

work page arXiv 2024

[17] [17]

Interpretable user satisfaction estimation for conversational systems with large language models

Ying-Chun Lin, Jennifer Neville, Jack Stokes, Longqi Yang, Tara Safavi, Mengting Wan, Scott Counts, Siddharth Suri, Reid Andersen, Xiaofeng Xu, Deepak Gupta, Sujay Kumar Jauhar, Xia Song, Georg Buscher, Saurabh Tiwary, Brent Hecht, and Jaime Teevan. Interpretable user satisfaction estimation for conversational systems with large language models. In Lun-We...

work page doi:10.18653/v1/2024.acl-long.598 2024

[18] [18]

LLM-based feature generation from text for interpretable machine learning, 2024

V ojtˇech Balek, Gustav Sourek, and Tomáš Kliegr. LLM-based feature generation from text for interpretable machine learning, 2024. URLhttps://arxiv.org/abs/2409.07132

work page arXiv 2024

[19] [19]

LLMs can construct powerful representations and streamline sample-efficient supervised learning, 2026

Ilker Demirel, Lawrence Shi, Zeshan Hussain, and David Sontag. LLMs can construct powerful representations and streamline sample-efficient supervised learning, 2026. URL https:// arxiv.org/abs/2603.11679

work page internal anchor Pith review arXiv 2026

[20] [20]

Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models, October 2023

An Yan, Yu Wang, Yiwu Zhong, Zexue He, Petros Karypis, Zihan Wang, Chengyu Dong, Amilcare Gentili, Chun-Nan Hsu, Jingbo Shang, and Julian McAuley. Robust and Interpretable Medical Image Classifiers via Concept Bottleneck Models, October 2023. URL http:// arxiv.org/abs/2310.03182. arXiv:2310.03182 [cs]

work page arXiv 2023

[21] [21]

Adaptive concept bottleneck for foundation models under distribution shifts, 2024

Jihye Choi, Jayaram Raghuram, Yixuan Li, and Somesh Jha. Adaptive concept bottleneck for foundation models under distribution shifts, 2024. URL https://arxiv.org/abs/ 2412.14097

work page arXiv 2024

[22] [22]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992,...

work page doi:10.18653/v1/d19-1410 2019

[23] [23]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volu...

work page 2019

[24] [24]

Text and Code Embeddings by Contrastive Pre-Training

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. Text and code embeddings by contrastive pre-training.arXiv preprint arXiv:2201.10005, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

Gaebler, Sharad Goel, Aziz Huq, and Prasanna Tambe

Johann D. Gaebler, Sharad Goel, Aziz Huq, and Prasanna Tambe. Auditing large language models for race & gender disparities: Implications for artificial intelligence–based hiring. Behavioral Science & Policy, 2025. doi: 10.1177/23794607251320229

work page doi:10.1177/23794607251320229 2025

[26] [26]

Field experiments on discrimination.Handbook of economic field experiments, 1:309–393, 2017

Marianne Bertrand and Esther Duflo. Field experiments on discrimination.Handbook of economic field experiments, 1:309–393, 2017

work page 2017

[27] [27]

how” and “why

S Michael Gaddis. Understanding the “how” and “why” aspects of racial-ethnic discrimination: A multimethod approach to audit studies.Sociology of Race and Ethnicity, 5(4):443–455, 2019

work page 2019

[28] [28]

Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination.American Economic Review, 94(4):991–1013, 2004

Marianne Bertrand and Sendhil Mullainathan. Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination.American Economic Review, 94(4):991–1013, 2004. 12

work page 2004

[29] [29]

Insight - Amazon scraps secret AI recruiting tool that showed bias against women

Jeffrey Dastin. Insight - Amazon scraps secret AI recruiting tool that showed bias against women. Reuters, October 2018. URL https://www.reuters.com/article/world/insight- amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women- idUSKCN1MK0AG/

work page 2018

[30] [30]

Azure AI Document Intelligence, 2024

Microsoft. Azure AI Document Intelligence, 2024. URL https://azure.microsoft.com/ en-us/products/ai-services/ai-document-intelligence

work page 2024

[31] [31]

OpenAI API

OpenAI. OpenAI API. https://platform.openai.com/docs/guides/embeddings, 2023. Accessed on October 14, 2024

work page 2023

[32] [32]

Ma- tryoshka representation learning.Advances in Neural Information Processing Systems, 35: 30233–30249, 2022

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Ma- tryoshka representation learning.Advances in Neural Information Processing Systems, 35: 30233–30249, 2022

work page 2022

[33] [33]

SFFA v. Harvard. Students for Fair Admissions, Inc., Petitioner, v. President and Fellows of Harvard College. Students for Fair Admissions, Inc., Petitioner, v. University of North Carolina, et al., 2023.https://www.supremecourt.gov/opinions/22pdf/20-1199_l6gn.pdf

work page 2023

[34] [34]

Kenneth Tay, Balasubramanian Narasimhan, and Trevor Hastie

J. Kenneth Tay, Balasubramanian Narasimhan, and Trevor Hastie. Elastic net regularization paths for all generalized linear models.Journal of Statistical Software, 106(1):1–31, 2023. doi: 10.18637/jss.v106.i01

work page doi:10.18637/jss.v106.i01 2023

[35] [35]

John Wiley & Sons, New York, 1987

Donald B Rubin.Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York, 1987

work page 1987

[36] [36]

Man is to computer programmer as woman is to homemaker? debiasing word embeddings.Advances in neural information processing systems, 29, 2016

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings.Advances in neural information processing systems, 29, 2016

work page 2016

[37] [37]

Null it out: Guarding protected attributes by iterative nullspace projection

Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 7237–7256, 2020

work page 2020

[38] [38]

Censoring representations with an adversary

Harrison Edwards and Amos Storkey. Censoring representations with an adversary. InProceed- ings of the International Conference in Learning Representations, 2016

work page 2016

[39] [39]

Mitigating unwanted biases with adversarial learning

Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340, 2018

work page 2018

[40] [40]

Fairfax County School Board

Coalition for TJ v. Fairfax County School Board. Coalition for TJ v. Fairfax County School Board, 2023. 68 F.4th 864 (4th Cir. 2023). https://law.justia.com/cases/federal/ appellate-courts/ca4/22-1280/22-1280-2023-05-23.html

work page 2023

[41] [41]

Boston School Committee

Boston Parent Coalition v. Boston School Committee. Boston Parent Coalition for Aca- demic Excellence Corp. v. School Committee for the City of Boston, 2023. 89 F.4th 46 (1st Cir. 2023). https://law.justia.com/cases/federal/appellate-courts/ca1/21- 1303/21-1303-2023-12-19.html

work page 2023

[42] [42]

Unsupervised elicitation of language models, 2025

Jiaxin Wen, Zachary Ankner, Arushi Somani, Peter Hase, Samuel Marks, Jacob Goldman- Wetzler, Linda Petrini, Henry Sleight, Collin Burns, He He, Shi Feng, Ethan Perez, and Jan Leike. Unsupervised elicitation of language models, 2025. URL https://arxiv.org/abs/ 2506.10139

work page arXiv 2025

[43] [43]

Gendered language in resumes and its implications for algorithmic bias in hiring

Prasanna Parasurama and João Sedoc. Gendered language in resumes and its implications for algorithmic bias in hiring. InProceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), Seattle, Washington, July 2022. Association for Computa- tional Linguistics. doi: 10 .18653/v1/2022.gebnlp-1.7. URL https://aclanthology.org/ 2022.ge...

work page 2022

[44] [44]

Equal Protection Under Algorithms: A New Statistical and Legal Framework.Michigan Law Review, 119(2):291–396, November 2020

Crystal Yang and Will Dobbie. Equal Protection Under Algorithms: A New Statistical and Legal Framework.Michigan Law Review, 119(2):291–396, November 2020. ISSN 0026-2234. doi: https://doi.org/10.36644/mlr.119.2.equal. URL https://repository.law.umich.edu/ mlr/vol119/iss2/3

work page doi:10.36644/mlr.119.2.equal 2020

[45] [45]

Pope and Justin R

Devin G. Pope and Justin R. Sydnor. Implementing Anti-discrimination Policies in Statistical Profiling Models.American Economic Journal: Economic Policy, 3(3):206–231, August 2011. ISSN 1945-7731. doi: 10.1257/pol.3.3.206. URL https://www.aeaweb.org/articles?id= 10.1257/pol.3.3.206

work page doi:10.1257/pol.3.3.206 2011

[46] [46]

Equal Opportunity and Affirmative Action via Counterfactual Predictions

Yixin Wang, Dhanya Sridhar, and David M Blei. Equal opportunity and affirmative action via counterfactual predictions.arXiv preprint arXiv:1905.10870, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[47] [47]

OpenAI GPT-5 System Card

OpenAI. GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Reevaluating the role of race and ethnicity in diabetes screening.arXiv preprint arXiv:2306.10220, 2023

Madison Coots, Soroush Saghafian, David Kent, and Sharad Goel. Reevaluating the role of race and ethnicity in diabetes screening.arXiv preprint arXiv:2306.10220, 2023

work page arXiv 2023

[49] [49]

original

OpenAI. GPT-5.4 Thinking system card. https://deploymentsafety.openai.com/gpt- 5-4-thinking, 2026. Includes addendum on GPT-5.4 mini, March 17, 2026. 14 A Mathematical Appendix Conventions and non-degeneracy assumptions.For an integrable random variable A and random object W , we write E[A|W=w] for a chosen measurable version of the conditional expectatio...

work page 2026