pith. sign in

arxiv: 2604.09946 · v1 · submitted 2026-04-10 · 💻 cs.CY · cs.IR

All Eyes on the Ranker: Participatory Auditing to Surface Blind Spots in Ranked Search Results

Pith reviewed 2026-05-10 15:52 UTC · model grok-4.3

classification 💻 cs.CY cs.IR
keywords participatory auditingranked search resultsuser-perceived impactsneural rankerssearch engine accountabilityepistemic impactsrepresentational harmsalgorithmic transparency
0
0 comments X

The pith

Participatory workshops show users link ranked search results to epistemic and social harms but overlook manipulations when trusting neural models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that involving ordinary users in auditing search engines can surface their own causal explanations for how ranked results shape what people know, how groups are represented, and what infrastructure and social outcomes follow. Through three workshops, 21 participants used a custom interface to compare a traditional lexical ranker against a neural semantic one, adjust transparency settings, and review deliberately altered result lists, then reflected on the wider effects. This process produced a four-part taxonomy of perceived impacts and exposed accountability gaps such as missing pipeline visibility and recourse options. At the same time the work reveals a built-in limit: when users view the neural model as competent, they extend trust that reduces their willingness to question the output, allowing the planted manipulations to pass unnoticed. The findings indicate that standard expert metrics alone leave these user-level understandings and vulnerabilities unexamined.

Core claim

Participatory auditing workshops using a custom interface across four tasks reveal that users construct causal narratives connecting ranked search properties to epistemic, representational, infrastructural, and downstream social impacts, yielding a taxonomy of those perceived effects, yet the same workshops demonstrate that accumulated trust in neural rerankers can suppress critical scrutiny and allow intentionally manipulated rankings to remain undetected.

What carries the argument

The participatory auditing process itself, consisting of guided tasks with a custom search interface that compares BM25 and MonoT5, varies transparency and controls, inserts adversarial ranking changes, and prompts reflexive causal narratives from participants.

If this is right

  • Conventional model-centric or expert-only evaluations of search systems miss user-articulated impacts and accountability gaps that participatory methods can identify.
  • Designers should provide visibility into the full ranking pipeline and mechanisms for recourse when users perceive harms.
  • Neural semantic rankers may require additional safeguards precisely because their apparent competence can reduce user vigilance.
  • Participatory auditing complements rather than replaces technical audits by surfacing contextual and downstream effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed trust effect implies that participatory checks may need to be run in low-familiarity or deliberately skeptical settings to remain effective.
  • The taxonomy could be tested for stability by repeating the workshops with users who have different levels of prior exposure to search technology.
  • Extending the approach to other ranking systems, such as recommendation feeds or news aggregators, would clarify whether the same trust-related blind spots appear.

Load-bearing premise

The causal narratives and perceptions gathered from 21 workshop participants on a custom interface match how broader populations experience and judge real deployed search engines, and the tested adversarial manipulations stand in for realistic threats.

What would settle it

A follow-up study in which a larger, demographically varied group uses an unmodified commercial search engine over multiple sessions and is then shown equivalent ranking manipulations to measure whether detection rates remain low once trust has formed.

Figures

Figures reproduced from arXiv: 2604.09946 by Anna Marie Rezk, Ayah Soufan, Craig Macdonald, Graham McDonald, Iadh Ounis, Patrizia Di Campli San Vito.

Figure 1
Figure 1. Figure 1: Simplified pipeline diagram for reflexive annotation activities to identify information and recourse needs of users. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: Screenshot of search interface of Task 1 with project branding (same interface also used for Tasks 2 and 4), right: [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Annotated interface screenshot (left), search pipeline (middle), and matrix of impacts x dimensions of impact (right). [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

Search engines that present users with a ranked list of search results are a fundamental technology for providing public access to information. Evaluations of such systems are typically conducted by domain experts and focus on model-centric metrics, relevance judgments, or output-based analyses, rather than on how accountability, harm, or trust are experienced by users. This paper argues that participatory auditing is essential for revealing users' causal and contextual understandings of how ranked search results produce impacts, particularly as ranking models appear increasingly convincing and sophisticated in their semantic interpretation of user queries. We report on three participatory auditing workshops (n=21) in which participants engaged with a custom search interface across four tasks, comparing a lexical ranker (BM25) and a neural semantic reranker (MonoT5), exploring varying levels of transparency and user controls, and examining an intentionally adversarially manipulated ranking. Reflexive activities prompted participants to articulate causal narratives linking search system properties to broader impacts. Synthesising the findings, we contribute a taxonomy of user-perceived impacts of ranked search results, spanning epistemic, representational, infrastructural, and downstream social impacts. However, interactions with the neural model revealed limits to participatory auditing itself: perceived system competence and accumulated trust reduced critical scrutiny during the workshop, allowing manipulations to go undetected. Participants expressed desire for visibility into the full search pipeline and recourse mechanisms. Together, these findings show how participatory auditing can surface user perceived impacts and accountability gaps that remain unseen when relying on conventional audits, while revealing where participatory auditing may encounter limitations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript reports findings from three participatory auditing workshops (n=21) in which participants interacted with a custom search interface comparing a lexical ranker (BM25) and a neural reranker (MonoT5) across four tasks that varied transparency, user controls, and an intentionally adversarially manipulated ranking. Reflexive activities elicited causal narratives linking system properties to impacts; these are synthesized into a taxonomy spanning epistemic, representational, infrastructural, and downstream social categories. The authors additionally observe that perceived competence of the neural model reduced participants' detection of manipulations, revealing limits to participatory auditing itself, and report participant desires for full-pipeline visibility and recourse mechanisms. The central claim is that participatory auditing surfaces user-perceived impacts and accountability gaps missed by conventional expert or metric-based evaluations.

Significance. If the empirical grounding holds, the work contributes a concrete taxonomy and a cautionary finding on trust-induced blind spots that could usefully inform HCI and algorithmic-accountability research on ranking systems. The explicit comparison of lexical vs. neural rankers under controlled transparency conditions and the reflexive workshop design provide a reproducible template for future participatory audits. The observation that accumulated trust can suppress critical scrutiny is a non-obvious, actionable insight for audit protocol design. These elements strengthen the case for user-centered methods alongside model-centric ones, though the small, non-representative sample and unvalidated interface realism constrain immediate generalizability.

major comments (3)
  1. [§3 and §4] §3 (Methods) and §4 (Findings): The derivation of the four-category taxonomy from participant narratives is presented without description of the qualitative analysis procedure (e.g., coding scheme, number of coders, inter-rater reliability, or saturation criteria). Because the taxonomy is the primary empirical contribution, this omission makes it impossible to evaluate its internal validity or replicability.
  2. [§5 and abstract] §5 (Discussion) and abstract: The claim that 'perceived system competence and accumulated trust reduced critical scrutiny' and allowed manipulations to go undetected rests on observations from the n=21 workshops using a custom interface and deliberately constructed adversarial rankings. No evidence is provided that the interface reproduces commercial ranking pipelines, query distributions, or real user stakes, nor is any external validation (log analysis, larger survey, or comparison to deployed systems) reported; this assumption is load-bearing for the 'limits to participatory auditing' conclusion.
  3. [§4.2] §4.2 (adversarial task results): The paper asserts that the intentionally manipulated rankings constitute realistic adversarial scenarios, yet supplies no justification or comparison showing that the perturbations match plausible real-world attacks on BM25 or MonoT5. If this premise does not hold, the finding that participatory auditing can surface otherwise unseen manipulations loses its empirical force.
minor comments (3)
  1. [§3.1] The participant recruitment and demographic details are only briefly summarized; expanding this subsection would help readers assess the scope of the 'user-perceived' claims.
  2. [§3.2] Several workshop task descriptions refer to 'varying levels of transparency' without a precise enumeration of the UI elements shown or hidden in each condition; a table or figure clarifying the four transparency variants would improve reproducibility.
  3. [§2] The related-work section would benefit from explicit citations to recent participatory-auditing studies in HCI (e.g., on content moderation or recommendation systems) to better situate the novelty of the taxonomy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below, outlining how we will strengthen the paper through revisions where appropriate.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Methods) and §4 (Findings): The derivation of the four-category taxonomy from participant narratives is presented without description of the qualitative analysis procedure (e.g., coding scheme, number of coders, inter-rater reliability, or saturation criteria). Because the taxonomy is the primary empirical contribution, this omission makes it impossible to evaluate its internal validity or replicability.

    Authors: We agree that a more explicit account of the qualitative analysis is required for transparency and replicability. The taxonomy emerged from an iterative reflexive thematic analysis of participant narratives and reflexive activities. In the revised manuscript, we will expand the Methods section (§3) with a dedicated subsection describing the coding scheme (inductive codes grouped into the four impact categories), the involvement of two researchers in independent coding followed by consensus discussions, and the saturation criteria assessed through iterative review of new data against emerging themes. This addition will directly address concerns about internal validity. revision: yes

  2. Referee: [§5 and abstract] §5 (Discussion) and abstract: The claim that 'perceived system competence and accumulated trust reduced critical scrutiny' and allowed manipulations to go undetected rests on observations from the n=21 workshops using a custom interface and deliberately constructed adversarial rankings. No evidence is provided that the interface reproduces commercial ranking pipelines, query distributions, or real user stakes, nor is any external validation (log analysis, larger survey, or comparison to deployed systems) reported; this assumption is load-bearing for the 'limits to participatory auditing' conclusion.

    Authors: We acknowledge that the observation is situated within the controlled workshop setting and that broader claims about commercial systems would require additional validation we do not provide. We will revise the abstract and §5 to qualify the finding as an insight emerging from participant interactions in this specific participatory auditing protocol, framing it as a cautionary note on potential limitations of the method rather than a general claim about neural rankers. We will also expand the limitations discussion to explicitly note the absence of external validation and suggest directions for future comparative studies. This tempers the conclusion while preserving the empirical observation from the data. revision: partial

  3. Referee: [§4.2] §4.2 (adversarial task results): The paper asserts that the intentionally manipulated rankings constitute realistic adversarial scenarios, yet supplies no justification or comparison showing that the perturbations match plausible real-world attacks on BM25 or MonoT5. If this premise does not hold, the finding that participatory auditing can surface otherwise unseen manipulations loses its empirical force.

    Authors: We accept that stronger justification for the adversarial design is needed. The manipulations were constructed to exploit documented vulnerabilities of lexical matching (e.g., term frequency manipulation) and neural semantic models (e.g., query drift via paraphrasing), drawing on prior IR literature on adversarial ranking. In the revision, we will augment §4.2 with an explicit rationale subsection that references relevant attack literature and clarifies that the scenarios function as illustrative probes to test participatory detection rather than exhaustive real-world attack simulations. This will better ground the contribution without overstating realism. revision: yes

Circularity Check

0 steps flagged

No circularity: qualitative empirical synthesis from participant data

full rationale

The paper reports three workshops (n=21) using a custom interface to elicit causal narratives from participants comparing BM25 and MonoT5 rankers, with reflexive activities leading to a taxonomy of impacts and observations on limits of participatory auditing. No equations, fitted parameters, predictions, or derivations appear in the abstract or described content. Claims rest on direct synthesis of participant articulations rather than any self-referential reduction, self-citation chain, or ansatz smuggled via prior work. The study is self-contained as inductive qualitative evidence; concerns about sample size or realism of manipulations pertain to external validity, not circularity per the analysis rules.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on interpretive synthesis of qualitative workshop data; no free parameters or invented entities are introduced, but the approach assumes participant reflections capture real causal understandings.

axioms (1)
  • domain assumption Reflexive activities in workshops accurately elicit participants' causal and contextual understandings of ranking impacts
    This assumption enables the synthesis of findings into the taxonomy of impacts.

pith-pipeline@v0.9.0 · 5595 in / 1238 out tokens · 58193 ms · 2026-05-10T15:52:49.673273+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    A Collaborative, Human-Centred Taxonomy of AI, Algorithmic, and Automation Harms,

    Abercrombie, G., Benbouzid, D., Giudici, P., Golpayegani, D., Hernandez, J., Noro, P., Pandit, H., Paraschou, E., Pownall, C., Prajapati, J., Sayre, M. A., Sengupta, U., Suriyawongkul, A., Thelot, R., Vei, S., and Waltersdorfer, L.A Collaborative, Human-Centred Taxonomy of AI, Algorithmic, and Automation Harms, Nov. 2024. arXiv:2407.01294 [cs]. [2]Aizenbe...

  2. [2]

    Version Number: 1

    Anand, A., Lyu, L., Idahl, M., W ang, Y., W allat, J., and Zhang, Z.Explainable Information Retrieval: A Survey, 2022. Version Number: 1

  3. [3]

    C., and Jing, F

    Becerra Sandoval, J. C., and Jing, F. S.Historical Methods for AI Evaluations, Assessments, and Audits. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency(Athens Greece, June 2025), ACM, pp. 1371–1386

  4. [4]

    C., and Jing, F

    Becerra Sandoval, J. C., and Jing, F. S.Rethinking AI Safety: Provocations from the History of Community-based Practices of Road and Driver Safety. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency(Athens Greece, June 2025), ACM, pp. 964–974

  5. [5]

    K., Dey, K., Hind, M., Hoffman, S

    Bellamy, R. K., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., Lohia, P., Martino, J., Mehta, S., Mojsilović, A., et al.AI Fairness 360: An Extensible Toolkit for Detecting and Mitigating Algorithmic Bias.IBM Journal of Research and Development 63, 4/5 (2019), 4–1

  6. [6]

    Springer Nature, 2023

    Berghout, E., Fijneman, R., Hendriks, L., de Boer, M., and Butijn, B.-J.Advanced Digital Auditing: Theory and Practice of Auditing Complex Information Systems and Technologies. Springer Nature, 2023

  7. [7]

    Bernard, N., and Balog, K.A Systematic Review of Fairness, Accountability, Transparency, and Ethics in Information Retrieval.ACM Computing Surveys 57, 6 (June 2025), 1–29

  8. [8]

    D.AI Auditing: The Broken Bus on the Road to AI Accountability

    Birhane, A., Steed, R., Ojewale, V., Vecchione, B., and Raji, I. D.AI Auditing: The Broken Bus on the Road to AI Accountability. In 2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)(Toronto, ON, Canada, Apr. 2024), IEEE, pp. 612–643. [10]Braun, V., and Clarke, V.Thematic Analysis: A Practical Guide. SAGE, Los Angeles, 2022. [11]Casti...

  9. [9]

    InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval(Taipei Taiwan, July 2023), ACM, pp

    Chari, A., MacAvaney, S., and Ounis, I.On the Effects of Regional Spelling Conventions in Retrieval Models. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval(Taipei Taiwan, July 2023), ACM, pp. 2220–2224. [13]Cleverdon, C.Evaluation Tests of Information Retrieval Systems.Journal of Documentat...

  10. [10]

    J., and Ramsey, B.An Experimental Comparison of Click Position-Bias Models

    Craswell, N., Zoeter, O., Taylor, M. J., and Ramsey, B.An Experimental Comparison of Click Position-Bias Models. InProceedings of the International Conference on Web Search and Web Data Mining, WSDM(2008), ACM, pp. 87–94

  11. [11]

    H., Claire, W., Han, H

    Deng, W. H., Claire, W., Han, H. Z., Hong, J. I., Holstein, K., and Eslami, M.WeAudit: Scaffolding User Auditors and AI Practitioners in Auditing Generative AI, Jan. 2025. arXiv:2501.01397 [cs]

  12. [12]

    Di Campli San Vito, P., Fringi, E., Johnston, P., Bezerra, L. C. T., Aristodemou, M., Shahandashti, S. F., O’Hara, E., Whyte, L. F., Luo, L., Wong, M., Soufan, A., Moshfeghi, Y., and Stumpf, S.Empowering Stakeholders with Participatory Auditing of Predictive AI: Perspectives from End-Users and Decision Subjects Without AI Expertise. InProceedings of the 2...

  13. [13]

    D., Biega, A

    Diaz, F., Mitra, B., Ekstrand, M. D., Biega, A. J., and Carterette, B.Evaluating Stochastic Rankings with Expected Exposure. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management(Virtual Event Ireland, Oct. 2020), ACM, pp. 275–284

  14. [14]

    D., McDonald, G., Raj, A., and Johnson, I.Overview of the TREC 2021 Fair Ranking Track

    Ekstrand, M. D., McDonald, G., Raj, A., and Johnson, I.Overview of the TREC 2021 Fair Ranking Track. InThe Thirtieth Text REtrieval Conference (TREC 2021) Proceedings(2022)

  15. [15]

    Regulation (EU) 2022/2065 of the European Parliament and of the Council on a Single Market for Digital Services (Digital Services Act)

    European Union. Regulation (EU) 2022/2065 of the European Parliament and of the Council on a Single Market for Digital Services (Digital Services Act). https://eur-lex.europa.eu/eli/reg/2022/2065/oj, 2022. Official Journal of the European Union

  16. [16]

    Regulation (EU) 2024/1689 of the European Parliament and of the Council Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act)

    European Union. Regulation (EU) 2024/1689 of the European Parliament and of the Council Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act). https://eur-lex.europa.eu/eli/reg/2024/1689/oj, 2024. Official Journal of the European Union

  17. [17]

    InProceedings of the Nineteenth ACM Conference on Recommender Systems(Prague Czech Republic, Sept

    Fabbri, M., and Boratto, L.Auditing Recommender Systems for User Empowerment in Very Large Online Platforms under the Digital Services Act. InProceedings of the Nineteenth ACM Conference on Recommender Systems(Prague Czech Republic, Sept. 2025), ACM, pp. 51–61

  18. [18]

    InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(2021), pp

    Formal, T., Piwowarski, B., and Clinchant, S.Splade: Sparse Lexical and Expansion Model for First Stage Ranking. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(2021), pp. 2288–2292

  19. [19]

    W., Wallach, H., Iii, H

    Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Iii, H. D., and Crawford, K.Datasheets for Datasets. Communications of the ACM 64, 12 (2021), 86–92. [24]Goodman, E. P., and Trehu, J.AI Audit Washing and Accountability.SSRN Electronic Journal(2022)

  20. [20]

    Holstein, K., Wortman V aughan, J., Daumé III, H., Dudik, M., and W allach, H.Improving Fairness in Machine Learning Systems: What do Industry Practitioners Need? InProceedings of the 2019 CHI conference on human factors in computing systems(2019), pp. 1–16

  21. [21]

    InProceedings of the 46th European Conference on Information Retrieval, ECIR 2024(2024), Springer, pp

    Jaenich, T., McDonald, G., and Ounis, I.Query Exposure Prediction for Groups of Documents in Rankings. InProceedings of the 46th European Conference on Information Retrieval, ECIR 2024(2024), Springer, pp. 143–158

  22. [22]

    A., Pan, B., Hembrooke, H., Radlinski, F., and Gay, G.Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search.ACM Trans

    Joachims, T., Granka, L. A., Pan, B., Hembrooke, H., Radlinski, F., and Gay, G.Evaluating the Accuracy of Implicit Feedback from Clicks and Query Reformulations in Web Search.ACM Trans. Inf. Syst. 25, 2 (2007), 7

  23. [23]

    2002), 422–446

    Järvelin, K., and Kekäläinen, J.Cumulated Gain-Based Evaluation of IR Techniques.ACM Transactions on Information Systems 20, 4 Participatory Auditing to Surface Blind Spots in Ranked Search Results•17 (Oct. 2002), 422–446

  24. [24]

    InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency(Athens Greece, June 2025), ACM, pp

    Kallina, E., Bohné, T., and Singh, J.Stakeholder Participation for Responsible AI Development: Disconnects Between Guidance and Current Practice. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency(Athens Greece, June 2025), ACM, pp. 1060–1079

  25. [25]

    InProceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization(San Luis Potosi Mexico, Oct

    Kallina, E., and Singh, J.Stakeholder Involvement for Responsible AI Development: A Process Framework. InProceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization(San Luis Potosi Mexico, Oct. 2024), ACM, pp. 1–14

  26. [26]

    A.Unequal Representation and Gender Stereotypes in Image Search Results for Occupations

    Kay, M., Matuszek, C., and Munson, S. A.Unequal Representation and Gender Stereotypes in Image Search Results for Occupations. InProceedings of the 33rd annual acm conference on human factors in computing systems(2015), pp. 3819–3828

  27. [27]

    In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020(2020), pp

    Khattab, O., and Zaharia, M.Colbert: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020(2020), pp. 39–48

  28. [28]

    B., Ghosh, S., Gummadi, K

    Kulshrestha, J., Eslami, M., Messias, J., Zafar, M. B., Ghosh, S., Gummadi, K. P., and Karahalios, K.Search Bias Quantification: Investigating Political Bias in Social Media and Web Search.Information Retrieval Journal 22, 1-2 (Apr. 2019), 188–227

  29. [29]

    InWebSci ’21: 13th ACM Web Science Conference 2021, Virtual Event, United Kingdom, June 21-25, 2021 (2021), ACM, pp

    Lewandowski, D., Sünkler, S., and Y agci, N.The Influence of Search Engine Optimization on Google’s Results: A Multi-Dimensional Approach for Detecting SEO. InWebSci ’21: 13th ACM Web Science Conference 2021, Virtual Event, United Kingdom, June 21-25, 2021 (2021), ACM, pp. 12–20

  30. [30]

    InProceedings of Advances in Neural Information Processing Systems(2020), pp

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., and Kiela, D.Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InProceedings of Advances in Neural Information Processing Systems(2020), pp. 9459–9474

  31. [31]

    Li, Y., and Goel, S.Making it Possible for the Auditing of AI: A Systematic Review of AI Audits and AI Auditability.Information Systems Frontiers 27, 3 (2025), 1121–1151

  32. [32]

    Lu, Q., Zhu, L., Xu, X., Whittle, J., Zowghi, D., and Jacqet, A.Responsible AI Pattern Catalogue: A Collection of Best Practices for AI Governance and Engineering.ACM Computing Surveys 56, 7 (July 2024), 1–35

  33. [33]

    InProceedings of the 30th ACM International Conference on Information & Knowledge Management(Virtual Event Queensland Australia, Oct

    Macdonald, C., Tonellotto, N., MacAvaney, S., and Ounis, I.PyTerrier: Declarative Experimentation in Python from BM25 to Dense Retrieval. InProceedings of the 30th ACM International Conference on Information & Knowledge Management(Virtual Event Queensland Australia, Oct. 2021), ACM, pp. 4526–4533

  34. [34]

    InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC USA, July 2024), ACM, pp

    McKechnie, J., McDonald, G., and Macdonald, C.Bi-Objective Negative Sampling for Sensitivity-Aware Search. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval(Washington DC USA, July 2024), ACM, pp. 2296–2300

  35. [35]

    InProceedings of the 26th International Conference on World Wide Web Companion - WWW ’17 Companion(Perth, Australia, 2017), ACM Press, pp

    Mehrotra, R., Anderson, A., Diaz, F., Sharma, A., W allach, H., and Yilmaz, E.Auditing Search Engines for Differential Satisfaction Across Demographics. InProceedings of the 26th International Conference on World Wide Web Companion - WWW ’17 Companion(Perth, Australia, 2017), ACM Press, pp. 626–633. [41]Mökander, J., and Floridi, L.Ethics-Based Auditing t...

  36. [36]

    R., and Floridi, L.Auditing Large Language Models: A Three-Layered Approach.AI and Ethics 4, 4 (2024), 1085–1115

    Mökander, J., Schuett, J., Kirk, H. R., and Floridi, L.Auditing Large Language Models: A Three-Layered Approach.AI and Ethics 4, 4 (2024), 1085–1115

  37. [37]

    U.Algorithms of Oppression: How Search Engines Reinforce Racism

    Noble, S. U.Algorithms of Oppression: How Search Engines Reinforce Racism. InAlgorithms of oppression. New York university press, 2018

  38. [38]

    InFindings of the Association for Computational Linguistics: EMNLP 2020(Online, Nov

    Nogueira, R., Jiang, Z., Pradeep, R., and Lin, J.Document Ranking with a Pretrained Sequence-to-Sequence Model. InFindings of the Association for Computational Linguistics: EMNLP 2020(Online, Nov. 2020), T. Cohn, Y. He, and Y. Liu, Eds., Association for Computational Linguistics, pp. 708–718

  39. [39]

    D.Towards AI Accountability Infrastructure: Gaps and Opportunities in AI Audit Tooling

    Ojewale, V., Steed, R., Vecchione, B., Birhane, A., and Raji, I. D.Towards AI Accountability Infrastructure: Gaps and Opportunities in AI Audit Tooling. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems(2025), pp. 1–29

  40. [40]

    Parry, A., Fröbe, M., MacAvaney, S., Potthast, M., and Hagen, M.Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models, Mar. 2024. arXiv:2403.07654

  41. [41]

    H., Shava, R., and Mustafaraj, E.Algorithmic Misjudgement in Google Search Results: Evidence from Auditing the US Online Electoral Information Environment

    Perreault, B., Lee, J. H., Shava, R., and Mustafaraj, E.Algorithmic Misjudgement in Google Search Results: Evidence from Auditing the US Online Electoral Information Environment. InThe 2024 ACM Conference on Fairness Accountability and Transparency(Rio de Janeiro Brazil, June 2024), ACM, pp. 433–443

  42. [42]

    InProceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval(2017), pp

    Radlinski, F., and Craswell, N.A Theoretical Framework for Conversational Search. InProceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval(2017), pp. 117–126

  43. [43]

    Robertson, R. E., Jiang, S., Joseph, K., Friedland, L., Lazer, D., and Wilson, C.Auditing Partisan Audience Bias Within Google Search.Proceedings of the ACM on human-computer interaction 2, CSCW (2018), 1–22

  44. [44]

    Trends Inf

    Robertson, S., and Zaragoza, H.The Probabilistic Relevance Framework: BM25 and Beyond.Found. Trends Inf. Retr. 3, 4 (Apr. 2009), 333–389

  45. [45]

    Sanderson, M.Test Collection Based Evaluation of Information Retrieval Systems. No. v. 4, Issue 4 in Foundations and Trends in Information Retrieval. Now, Boston, Mass, 2010. 18•Rezk et al

  46. [46]

    Sandvig, C., Hamilton, K., Karahalios, K., and Langbort, C.Auditing Algorithms: Research Methods for Detecting Discrimination on Internet Platforms.Data and Discrimination: Converting Critical Concerns into Productive Inquiry 22, 2014 (2014), 4349–4357

  47. [47]

    InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society(Montréal QC Canada, Aug

    Shelby, R., Rismani, S., Henne, K., Moon, A., Rostamzadeh, N., Nicholas, P., Yilla-Akbari, N., Gallegos, J., Smart, A., Garcia, E., and Virk, G.Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society(Montréal QC Canada, Aug. 2023), ACM, pp. 723–741

  48. [48]

    M.Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness

    Voorhees, E. M.Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness. InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(1998), pp. 315–323

  49. [49]

    InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society(2024), vol

    Wilson, K., and Caliskan, A.Gender, Race, and Intersectional Bias in Resume Screening via Language Model Retrieval. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society(2024), vol. 7, pp. 1578–1590

  50. [50]

    option” versus passive user interaction? A.2.3 Live Wikipedia Search.Before using the custom search interface, participants view a live Wikipedia search 9 for the query “chemist

    Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., and Choi, Y.Defending Against Neural Fake News. Advances in Neural Information Processing Systems 32(2019). A Appendix A.1 Workshop Protocol Table 4. Session Overview Participant Count WS1: 4; WS2: 9; WS3: 8 Format In-person group workshop with individual device-based search tasks...