A Conceptual Framework for Evaluating Fairness in Search

Anubrata Das; Matthew Lease

arxiv: 1907.09328 · v1 · pith:BESUJ7NYnew · submitted 2019-07-22 · 💻 cs.IR

A Conceptual Framework for Evaluating Fairness in Search

Anubrata Das , Matthew Lease This is my paper

Pith reviewed 2026-05-24 17:54 UTC · model grok-4.3

classification 💻 cs.IR

keywords distributional fairnesssearch evaluationfairness axiomsTREC collectionsrelevance metricsmetric interpolationinformation retrieval

0 comments

The pith

A conceptual framework evaluates search fairness via axioms for distributional fairness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines distributional fairness as a property of search result distributions and builds a conceptual framework around it. It formulates axioms that any ideal fairness evaluation framework must satisfy. Existing TREC collections are shown to be reusable for fairness studies once data bias is measured. Analyses demonstrate divergence between relevance and fairness metrics, and a simple interpolation combines the two into one score. A sympathetic reader would care because search systems have long optimized only for relevance, and this supplies a principled way to incorporate fairness without discarding prior evaluation practices.

Core claim

We define a notion of distributional fairness and provide a conceptual framework for evaluating search results based on it. As part of this, we formulate a set of axioms which an ideal evaluation framework should satisfy for distributional fairness. We show how existing TREC test collections can be repurposed to study fairness, measure potential data bias to inform test collection design, demonstrate metric divergence between relevance and fairness, and describe a simple but flexible interpolation strategy for integrating relevance and fairness into a single metric.

What carries the argument

The set of axioms that an ideal distributional fairness evaluation framework must satisfy, around which the conceptual framework is constructed.

If this is right

Fairness metrics diverge from relevance metrics on real collections, requiring explicit trade-off handling.
An interpolation strategy produces a single metric usable for both optimization and evaluation.
Repurposed TREC collections become viable for fairness studies after bias quantification.
Test collection design for fair search can be guided by measured data bias levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same axiomatic approach could be tested on fairness in recommendation or question-answering systems.
Future collections might be built from the start to satisfy the axioms rather than retrofitted.
The framework offers a template for defining fairness axioms in other ranked-output domains.

Load-bearing premise

The axioms correctly capture what an ideal distributional fairness evaluation framework must satisfy, and repurposing existing TREC collections introduces no critical new biases.

What would settle it

A concrete case in which search results judged fair by external criteria violate one or more of the stated axioms, or in which the interpolated metric produces rankings that are worse on both relevance and fairness than optimizing the two separately.

Figures

Figures reproduced from arXiv: 1907.09328 by Anubrata Das, Matthew Lease.

**Figure 2.** Figure 2: Correlation in system scores by metrics for relevance vs. fairness (for uniform vs. dataset target distributions). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

While search efficacy has been evaluated traditionally on the basis of result relevance, fairness of search has attracted recent attention. In this work, we define a notion of distributional fairness and provide a conceptual framework for evaluating search results based on it. As part of this, we formulate a set of axioms which an ideal evaluation framework should satisfy for distributional fairness. We show how existing TREC test collections can be repurposed to study fairness, and we measure potential data bias to inform test collection design for fair search. A set of analyses show metric divergence between relevance and fairness, and we describe a simple but flexible interpolation strategy for integrating relevance and fairness into a single metric for optimization and evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper defines distributional fairness with axioms and an interpolation method for blending it with relevance, which is a useful conceptual step for IR fairness work but stays definitional without strong validation.

read the letter

The core new piece is the distributional fairness notion plus a set of axioms that any evaluation framework should meet, along with a practical interpolation between fairness and relevance scores. They also show how to reuse TREC collections for fairness studies and flag data bias issues in those collections. The analyses on metric divergence between relevance and fairness give a clear sense of why separate handling matters. This is solid as a starting structure for people who need to evaluate both aspects together in search systems. It is honest about the repurposing approach and does not overclaim empirical results. The axioms and interpolation are presented as flexible tools rather than finished solutions, which keeps the contribution proportionate. The main limitation is that the work is almost entirely conceptual. The axioms are motivated but not tested against user judgments or real deployment outcomes, so it is unclear how well they capture what fairness should mean in practice. The TREC bias measurements are noted but the paper does not quantify how much those biases affect downstream fairness conclusions. No concrete numbers on how the interpolation changes system rankings or improves joint optimization appear in the provided material. This is aimed at IR researchers already working on fairness metrics who want a shared language and evaluation skeleton. A reader building new fairness measures or running experiments on existing test collections would find usable ideas here. The framework is coherent on its own terms and engages the literature without circularity, so it deserves a serious referee even though revisions would likely need added experiments or axiom validation.

Referee Report

2 major / 2 minor

Summary. The paper defines a notion of distributional fairness for search results and presents a conceptual framework for evaluating search systems with respect to it. It formulates a set of axioms that any ideal evaluation framework for distributional fairness should satisfy, demonstrates how existing TREC test collections can be repurposed for fairness studies while quantifying associated data biases, reports analyses showing divergence between relevance-based and fairness-based metrics, and describes a flexible interpolation strategy for combining relevance and fairness into a single optimization metric.

Significance. If the proposed axioms and framework gain acceptance, the work could provide a useful foundation for standardizing fairness evaluation in information retrieval, moving beyond ad-hoc fairness measures. The practical elements—repurposing of TREC collections with bias measurement and the interpolation approach—are concrete contributions that could aid adoption. The paper is explicitly conceptual rather than empirical or axiomatic-derivational, so its value lies in the clarity and utility of the proposed definitions and strategy.

major comments (2)

[Axioms formulation section] The central contribution rests on the set of axioms for an ideal distributional fairness framework, yet the manuscript provides no formal argument, completeness proof, or comparison showing why these particular axioms (as opposed to alternatives) are necessary and sufficient; without this, the framework's status as 'ideal' remains a definitional choice rather than a derived property.
[TREC repurposing and bias measurement section] The repurposing of TREC collections for fairness analysis includes a data-bias measurement, but the manuscript does not quantify how large the measured bias must be before it invalidates downstream fairness conclusions or provide a mitigation strategy; this directly affects the claim that the collections can be reliably used for fairness studies.

minor comments (2)

[Interpolation strategy section] Notation for the interpolation strategy should be introduced with an explicit equation rather than described only in prose, to allow readers to reproduce the combined metric exactly.
[Introduction] The abstract and introduction use 'distributional fairness' without an immediate formal definition; a one-sentence mathematical characterization early in the paper would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation of minor revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Axioms formulation section] The central contribution rests on the set of axioms for an ideal distributional fairness framework, yet the manuscript provides no formal argument, completeness proof, or comparison showing why these particular axioms (as opposed to alternatives) are necessary and sufficient; without this, the framework's status as 'ideal' remains a definitional choice rather than a derived property.

Authors: The paper is explicitly positioned as conceptual rather than axiomatic-derivational. The axioms are offered as an initial set of desirable properties motivated by the requirements of distributional fairness in search, not as a formally proven minimal or complete basis. We agree that necessity, sufficiency, and comparisons to alternatives are not established here. We will revise the axioms section to state this scope explicitly and to frame the axioms as a proposal open to refinement by the community. revision: partial
Referee: [TREC repurposing and bias measurement section] The repurposing of TREC collections for fairness analysis includes a data-bias measurement, but the manuscript does not quantify how large the measured bias must be before it invalidates downstream fairness conclusions or provide a mitigation strategy; this directly affects the claim that the collections can be reliably used for fairness studies.

Authors: The referee correctly notes the absence of a specific bias threshold or mitigation strategy. The bias quantification is presented to inform users of the repurposed collections rather than to certify them as suitable without qualification. We will revise the relevant section to state this limitation clearly and to identify the development of such thresholds and strategies as an open research question. revision: yes

Circularity Check

0 steps flagged

Conceptual proposal with no circular derivation chain

full rationale

The paper introduces a definition of distributional fairness, formulates axioms that an ideal framework should satisfy, demonstrates repurposing of existing TREC collections, measures associated data bias, and describes an interpolation between relevance and fairness metrics. These steps are presented as definitional and conceptual contributions rather than empirical predictions or derivations from first principles. No equations reduce outputs to fitted inputs by construction, no self-citations serve as load-bearing uniqueness theorems, and the work remains self-contained against external benchmarks without renaming known results or smuggling ansatzes. The reader's assessment of score 2 aligns with minor self-citation potential that is not load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests primarily on the appropriateness of the newly formulated axioms for distributional fairness and on the validity of repurposing TREC collections; no free parameters are mentioned and the distributional fairness notion is the main invented entity.

axioms (1)

domain assumption A set of axioms exists that any ideal evaluation framework for distributional fairness in search must satisfy
The paper formulates these axioms as the foundation of the proposed framework.

invented entities (1)

Distributional fairness no independent evidence
purpose: To provide a measurable notion of fairness based on result distribution across groups
New concept introduced to ground the evaluation framework

pith-pipeline@v0.9.0 · 5629 in / 1228 out tokens · 37125 ms · 2026-05-24T17:54:14.635123+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Software Fairness: An Analysis and Survey
cs.SE 2022-05 unverdicted novelty 4.0

A literature survey of 164 papers on software fairness reveals gaps in requirements engineering, intersectional measures, unstructured data, and white-box ML methods.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. In Proceedings of the second ACM international con- ference on web search and data mining . ACM, 5–14

work page 2009
[2]

Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and WB Croft. 2018. Unbiased learning to rank with unbiased propensity estimation. arXiv:1804.05938 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[3]

Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. 2018. Equity of attention: Amortizing individual fairness in rankings. arXiv:1805.01788 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Elisa Celis, Damian Straszak, and Nisheeth K

L. Elisa Celis, Damian Straszak, and Nisheeth K. Vishnoi. 2018. Ranking with Fairness Constraints. In ICALP. A Conceptual Framework for Evaluating Fairness in Search , July, 2019,

work page 2018
[5]

Le Chen, Ruijun Ma, Anikó Hannák, and Christo Wilson. 2018. Investigating the impact of gender on rank in resume search engines. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems . ACM, 651

work page 2018
[6]

Ekstrand, Robin Burke, and Fernando Diaz

Michael D. Ekstrand, Robin Burke, and Fernando Diaz. 2019. Fairness and Discrimination in Retrieval and Recommendation. In Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval (SIGIR’19) . ACM, New York, NY, USA, 1403–1404. https: //doi.org/10.1145/3331184.3331380

work page doi:10.1145/3331184.3331380 2019
[7]

Danielle Ensign, Sorelle A Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. 2017. Runaway feedback loops in predictive policing.arXiv preprint arXiv:1706.09847 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Robert Epstein and Ronald E Robertson. 2015. The search engine manipulation effect (SEME) and its possible impact on the outcomes of elections. Proceedings of the National Academy of Sciences 112, 33 (2015), E4512–E4521

work page 2015
[9]

Matthew Lease. 2018. Fact Checking and Information Retrieval. (2018)

work page 2018
[10]

Q Vera Liao and Wai-Tat Fu. 2013. Beyond the filter bubble: interactive effects of perceived threat and topic involvement on selective exposure to information. In Proceedings of CHI. ACM, 2359–2368

work page 2013
[11]

Christina Lioma, Jakob Grue Simonsen, and Birger Larsen. 2017. Evaluation measures for relevance and credibility in ranked lists. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval . ACM, 91–98

work page 2017
[12]

Craig MacDonald, Iadh Ounis, and Ian Soboroff. 2007. Overview of the TREC 2007 Blog Track. In TREC

work page 2007
[13]

Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia Lalmas, and Fernando Diaz. 2018. Towards a fair marketplace: Counterfactual evaluation of the trade-off between relevance, fairness & satisfaction in recommendation systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 2243–2251

work page 2018
[14]

Safiya Umoja Noble. 2018. Algorithms of oppression: How search engines reinforce racism. NYU Press

work page 2018
[15]

Piotr Sapiezynski, Wesley Zeng, Ronald E Robertson, Alan Mislove, and Christo Wilson. 2019. Quantifying the Impact of User Attentionon Fair Group Represen- tation in Ranked Lists. In Companion Proceedings of The 2019 World Wide Web Conference. ACM, 553–562

work page 2019
[16]

Ashudeep Singh and Thorsten Joachims. 2018. Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . ACM, 2219–2228

work page 2018
[17]

Voorhees and Donna K

Ellen M. Voorhees and Donna K. Harman. 1999. Overview of the Eighth Text REtrieval Conference (TREC-8). In TREC

work page 1999
[18]

Ke Yang and Julia Stoyanovich. 2017. Measuring Fairness in Ranked Outputs. In SSDBM

work page 2017
[19]

Baeza-Yates

Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Mega- hed, and Ricardo A. Baeza-Yates. 2017. FA*IR: A Fair Top-k Ranking Algorithm. In CIKM

work page 2017
[20]

Meike Zehlike and Carlos Castillo. 2018. Reducing Disparate Exposure in Ranking: A Learning To Rank Approach. CoRR abs/1805.08716 (2018)

work page arXiv 2018

[1] [1]

Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, and Samuel Ieong. 2009. Diversifying search results. In Proceedings of the second ACM international con- ference on web search and data mining . ACM, 5–14

work page 2009

[2] [2]

Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and WB Croft. 2018. Unbiased learning to rank with unbiased propensity estimation. arXiv:1804.05938 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[3] [3]

Asia J Biega, Krishna P Gummadi, and Gerhard Weikum. 2018. Equity of attention: Amortizing individual fairness in rankings. arXiv:1805.01788 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Elisa Celis, Damian Straszak, and Nisheeth K

L. Elisa Celis, Damian Straszak, and Nisheeth K. Vishnoi. 2018. Ranking with Fairness Constraints. In ICALP. A Conceptual Framework for Evaluating Fairness in Search , July, 2019,

work page 2018

[5] [5]

Le Chen, Ruijun Ma, Anikó Hannák, and Christo Wilson. 2018. Investigating the impact of gender on rank in resume search engines. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems . ACM, 651

work page 2018

[6] [6]

Ekstrand, Robin Burke, and Fernando Diaz

Michael D. Ekstrand, Robin Burke, and Fernando Diaz. 2019. Fairness and Discrimination in Retrieval and Recommendation. In Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Infor- mation Retrieval (SIGIR’19) . ACM, New York, NY, USA, 1403–1404. https: //doi.org/10.1145/3331184.3331380

work page doi:10.1145/3331184.3331380 2019

[7] [7]

Danielle Ensign, Sorelle A Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. 2017. Runaway feedback loops in predictive policing.arXiv preprint arXiv:1706.09847 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Robert Epstein and Ronald E Robertson. 2015. The search engine manipulation effect (SEME) and its possible impact on the outcomes of elections. Proceedings of the National Academy of Sciences 112, 33 (2015), E4512–E4521

work page 2015

[9] [9]

Matthew Lease. 2018. Fact Checking and Information Retrieval. (2018)

work page 2018

[10] [10]

Q Vera Liao and Wai-Tat Fu. 2013. Beyond the filter bubble: interactive effects of perceived threat and topic involvement on selective exposure to information. In Proceedings of CHI. ACM, 2359–2368

work page 2013

[11] [11]

Christina Lioma, Jakob Grue Simonsen, and Birger Larsen. 2017. Evaluation measures for relevance and credibility in ranked lists. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval . ACM, 91–98

work page 2017

[12] [12]

Craig MacDonald, Iadh Ounis, and Ian Soboroff. 2007. Overview of the TREC 2007 Blog Track. In TREC

work page 2007

[13] [13]

Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia Lalmas, and Fernando Diaz. 2018. Towards a fair marketplace: Counterfactual evaluation of the trade-off between relevance, fairness & satisfaction in recommendation systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 2243–2251

work page 2018

[14] [14]

Safiya Umoja Noble. 2018. Algorithms of oppression: How search engines reinforce racism. NYU Press

work page 2018

[15] [15]

Piotr Sapiezynski, Wesley Zeng, Ronald E Robertson, Alan Mislove, and Christo Wilson. 2019. Quantifying the Impact of User Attentionon Fair Group Represen- tation in Ranked Lists. In Companion Proceedings of The 2019 World Wide Web Conference. ACM, 553–562

work page 2019

[16] [16]

Ashudeep Singh and Thorsten Joachims. 2018. Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining . ACM, 2219–2228

work page 2018

[17] [17]

Voorhees and Donna K

Ellen M. Voorhees and Donna K. Harman. 1999. Overview of the Eighth Text REtrieval Conference (TREC-8). In TREC

work page 1999

[18] [18]

Ke Yang and Julia Stoyanovich. 2017. Measuring Fairness in Ranked Outputs. In SSDBM

work page 2017

[19] [19]

Baeza-Yates

Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Mega- hed, and Ricardo A. Baeza-Yates. 2017. FA*IR: A Fair Top-k Ranking Algorithm. In CIKM

work page 2017

[20] [20]

Meike Zehlike and Carlos Castillo. 2018. Reducing Disparate Exposure in Ranking: A Learning To Rank Approach. CoRR abs/1805.08716 (2018)

work page arXiv 2018