A Sensitivity-Aware Test Collection for Search Among Personal Information
Pith reviewed 2026-06-29 00:34 UTC · model grok-4.3
The pith
A new test collection from the Enron emails supplies queries, relevance judgments, and sensitivity labels to support search systems that avoid exposing private data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We crowdsource 150 query formulations for 50 different topics and 11,471 query-relevance assessments for a subset of the Enron documents that have been manually labelled for sensitivity. We follow best practices for using large language models (LLMs) in Information Retrieval evaluation to extend the collection further with additional LLM judged query-relevance assessments and sensitivity labels. We present baseline performances for relevance, sensitivity classification, and sensitivity-aware search on the collection.
What carries the argument
The sensitivity-aware test collection built from the Enron email corpus, consisting of crowdsourced queries, relevance assessments, and manual sensitivity labels, extended by LLM judgments.
If this is right
- The collection lets researchers measure how well retrieval models can return relevant results while suppressing sensitive documents.
- It provides a shared resource for comparing sensitivity classification methods on real personal correspondence.
- Baseline scores establish reference points against which new sensitivity-aware ranking algorithms can be tested.
- Public release through standard IR toolkits supports reproducible experiments with both sparse and dense retrieval.
Where Pith is reading between the lines
- The same construction approach could be applied to other domains that mix useful records with private data, such as medical or financial archives.
- LLM-assisted label extension may speed up collection building but could embed model-specific biases into future sensitivity-aware systems.
- The resource opens the possibility of measuring the exact relevance cost imposed by adding explicit sensitivity constraints to standard search pipelines.
Load-bearing premise
Crowdsourced relevance and sensitivity labels, together with LLM-generated extensions, accurately capture real user needs and privacy concerns without systematic bias or low inter-annotator agreement.
What would settle it
A follow-up study that finds low agreement between the crowdsourced sensitivity labels and labels assigned by actual former Enron employees or privacy experts would show the collection does not model the target problem.
Figures
read the original abstract
Traditional search tasks aim to satisfy user information needs by returning a subset of a collection of documents, ranked by the documents' relevance to a user query. However, some collections that contain useful information also contain sensitive personal information. Recently, there has been increasing interest in the development of Sensitivity-Aware Search (SAS) retrieval models to provide users with effective retrieval results without revealing such sensitive information. To develop such systems, test collections containing both sensitive and non-sensitive information, a set of queries, and query-document relevance assessments are required. The Enron email corpus contains real business-related emails, where some emails also contain sensitive personal information. However, the original Enron collection does not contain queries or query-relevance assessments. To this end, we crowdsource 150 query formulations for 50 different topics and 11,471 query-relevance assessments for a subset of the Enron documents that have been manually labelled for sensitivity. We follow best practices for using large language models (LLMs) in Information Retrieval evaluation to extend the collection further with additional LLM judged query-relevance assessments and sensitivity labels. We present baseline performances for relevance, sensitivity classification, and sensitivity-aware search on the collection. We make the collection available, including through the popular ir_datasets package, and provide pre-built sparse and dense indices on Huggingface to facilitate easy experimentation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript constructs and releases a sensitivity-aware search (SAS) test collection based on a manually sensitivity-labelled subset of the Enron email corpus. It reports crowdsourcing 150 query formulations across 50 topics together with 11,471 query-relevance assessments, extends the judgments with LLM-generated labels following stated best practices, supplies baseline results for relevance ranking, sensitivity classification, and sensitivity-aware retrieval, and makes the full resource (including sparse/dense indices) available via ir_datasets and Hugging Face.
Significance. If label quality is adequately documented, the collection supplies a needed public resource for evaluating retrieval systems that must trade off relevance against privacy constraints in personal collections. Integration with standard IR tooling and the provision of baselines lower the barrier for follow-on work in sensitivity-aware IR.
major comments (1)
- [Abstract] Abstract: the statement that 'best practices for using large language models (LLMs) in Information Retrieval evaluation' were followed is not accompanied by any reported inter-annotator agreement figures, label-quality statistics, or validation results comparing LLM judgments to human gold labels. These metrics are load-bearing for assessing the reliability of the released collection.
minor comments (1)
- A compact table collating collection statistics (topics, queries, documents, assessment counts, sensitivity distribution) would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive review and recommendation of minor revision. We address the sole major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the statement that 'best practices for using large language models (LLMs) in Information Retrieval evaluation' were followed is not accompanied by any reported inter-annotator agreement figures, label-quality statistics, or validation results comparing LLM judgments to human gold labels. These metrics are load-bearing for assessing the reliability of the released collection.
Authors: We agree that the abstract claim would benefit from explicit reference to supporting metrics. The manuscript body details the LLM prompting strategy (following the cited best-practice guidelines) and reports a human validation subset comparing LLM labels to crowdsourced gold labels, along with label-quality statistics. We will revise the abstract to include a concise parenthetical summary of these results (e.g., agreement rate with human gold labels) so that the reliability claim is directly supported in the abstract itself. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper is a standard resource-construction contribution that describes crowdsourcing 150 queries and 11,471 assessments on a sensitivity-labelled Enron subset, followed by LLM extensions and baseline runs. No equations, fitted parameters, derivations, or predictions appear in the described pipeline; the central claim is the production and release of the collection itself, which is independent of any prior fitted result or self-citation chain. The work follows established IR practices for new test collections without reducing any load-bearing step to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Crowdsourced relevance and sensitivity labels on Enron emails are sufficiently reliable for IR evaluation
- domain assumption LLM judgments can extend the collection while preserving label quality when best practices are followed
Reference graph
Works this paper leans on
-
[1]
Nasreen Abdul-Jaleel, James Allan, W Bruce Croft, Fernando Diaz, Leah Larkey, Xiaoyan Li, Mark D Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. InProc. of TREC
2004
-
[2]
Giambattista Amati, Giuseppe Amodeo, Marco Bianchi, Carlo Gaibisso, and Giorgio Gambosi. 2008. FUB, IASI-CNR and University of Tor Vergata at TREC 2008 blog track. InProc. of TREC
2008
-
[3]
Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness.TOIS 20, 4 (2002)
2002
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Jason R Baron, Nathaniel W Rollings, and Douglas W. Oard. 2023. Using ChatGPT for the FOIA Exemption 5 Deliberative Process Privilege. InLegalAIIA @ ICAIL
2023
-
[6]
Karl Branting, Bradford Brown, Chris Giannella, James Van Guilder, Jeff Harrold, Sarah Howell, and Jason R Baron. 2025. Decision support for detecting sensitive text in government records: Anonymous submission.AI Law33, 1 (2025)
2025
-
[7]
Naciye Celebi and Narasimha Shashidhar. 2022. Topic modeling in the enron dataset. InProc. of BigData
2022
-
[8]
Xuejun Chang, Debabrata Mishra, Craig Macdonald, and Sean MacAvaney. 2024. Neural Passage Quality Estimation for Static Pruning. InProc. of SIGIR
2024
-
[9]
Gordon V Cormack, Maura R Grossman, Bruce Hedin, and Douglas W. Oard
-
[10]
Overview of the TREC 2010 legal track. InProc. of TREC
2010
-
[11]
John Dang, Shivalika Singh, Daniel D’souza, Arash Ahmadian, Alejandro Sala- manca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, et al. 2024. Aya expanse: Combining research breakthroughs for a new multilingual frontier.arXiv preprint arXiv:2412.04261(2024)
-
[12]
Department for Science, Innovation and Technology, Department for Business and Trade, Office for Artificial Intelligence, Department for Digital, Culture, Media & Sport, and Department for Business, En- ergy & Industrial Strategy. 2019. Artificial Intelligence Sector Deal. https://www.gov.uk/government/publications/artificial-intelligence-sector-deal (2019)
2019
-
[13]
Laura Dietz, Oleg Zendel, Peter Bailey, Charles LA Clarke, Ellese Cotterill, Jeff Dalton, Faegheh Hasibi, Mark Sanderson, and Nick Craswell. 2025. Principles and Guidelines for the Use of LLM Judges. InProc. of ICTIR
2025
-
[14]
Guglielmo Faggioli, Laura Dietz, Charles LA Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, et al. 2023. Perspectives on large language models for relevance judgment. InProc. of ICTIR
2023
-
[16]
arXiv preprint arXiv:2109.10086(2021)
SPLADE v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086(2021)
-
[17]
Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant
-
[18]
Towards effective and efficient sparse neural information retrieval.TOIS 42, 5 (2024)
2024
-
[19]
Timothy Gollins, Graham McDonald, Craig Macdonald, and Iadh Ounis. 2014. On Using Information Retrieval for the Selection and Sensitivity Review of Digital Public Records. InPIR @ SIGIR
2014
-
[20]
Marti A Hearst. 2005. Teaching applied natural language processing: Triumphs and tribulations. InETMTNLP @ ACL
2005
-
[21]
Modassir Iqbal, Katie Shilton, Mahmoud F Sayed, Douglas Oard, Jonah Lynn Rivera, and William Cox. 2021. Search with discretion: value sensitive design of training data for information retrieval.Proc. of PACMHCI5 (2021)
2021
-
[22]
2022.Archives, access and artificial intelligence: working with born- digital and digitized archival collections
Lise Jaillant. 2022.Archives, access and artificial intelligence: working with born- digital and digitized archival collections. Bielefeld University Press
2022
-
[23]
Lise Jaillant. 2022. How can we make born-digital and digitised archives more accessible? Identifying obstacles and solutions.Arch. Sci.22, 3 (2022)
2022
-
[24]
Lise Jaillant and Annalina Caputo. 2022. Unlocking digital archives: cross- disciplinary perspectives on AI and born-digital data.AI Soc37, 3 (2022)
2022
-
[25]
Lise Jaillant and Arran Rees. 2023. Applying AI to digital archives: trust, collabo- ration and shared professional ethics.DSH38, 2 (2023)
2023
-
[26]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.TOIS20, 4 (2002)
2002
-
[27]
Robertson
Karen Sparck Jones, Steve Walker, and Stephen E. Robertson. 2000. A probabilistic model of information retrieval: development and comparative experiments: Part
2000
-
[28]
Maurice G Kendall. 1938. A new measure of rank correlation.Biometrika30, 1-2 (1938)
1938
-
[29]
Bryan Klimt and Yiming Yang. 2004. Introducing the Enron corpus. InProc. of CEAS
2004
- [30]
-
[31]
Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In RepL4NLP @ ACL
2021
-
[32]
Carolyn E Lipscomb. 2000. Medical subject headings (MeSH).BMLA88, 3 (2000)
2000
-
[33]
Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2024. Fine- tuning llama for multi-stage text retrieval. InProc. of SIGIR
2024
-
[34]
Sean MacAvaney. 2025. Artifact Sharing for Information Retrieval Research. In Proc. of SIGIR
2025
-
[35]
Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, and Nazli Goharian. 2021. Simplified data wrangling with ir_datasets. InProc. of SIGIR
2021
-
[36]
Maarten Marx, Maik Larooij, Filipp Perasedillo, and Jaap Kamps. 2023. Enticing Local Governments to Produce FAIR Freedom of Information Act Dossiers. In Proc. of ECIR
2023
-
[37]
Andrew McCallum, Andrés Corrada-Emmanuel, and Xuerui Wang. 2004. The author-recipient-topic model for topic and role discovery in social networks: Experiments with enron and academic email. InNIPS’04 Workshop on Structured Data and Representations in Probabilistic Models for Categorization
2004
-
[38]
Graham McDonald, Nicolás García-Pedrajas, Craig Macdonald, and Iadh Ounis
-
[39]
A study of SVM kernel functions for sensitivity classification ensembles with POS sequences. InProc. of SIGIR
-
[40]
Graham McDonald, Craig Macdonald, and Iadh Ounis. 2017. Enhancing sensi- tivity classification with semantic features using word embeddings. InProc. of ECIR
2017
-
[41]
Graham McDonald, Craig Macdonald, and Iadh Ounis. 2018. Towards maximising openness in digital sensitivity review using reviewing time predictions. InProc. of ECIR
2018
-
[42]
Graham McDonald, Craig Macdonald, and Iadh Ounis. 2020. How the accuracy and confidence of sensitivity classification affects digital sensitivity review.TOIS 39, 1 (2020)
2020
-
[43]
Graham McDonald, Craig Macdonald, Iadh Ounis, and Timothy Gollins. 2014. Towards a classifier for digital sensitivity review. InProc. of ECIR
2014
-
[44]
Jack McKechnie. 2024. Cascading Ranking Pipelines for Sensitivity-Aware Search. InProc. of ECIR
2024
-
[45]
Jack McKechnie, Graham McDonald, and Craig Macdonald. 2024. Bi-Objective Negative Sampling for Sensitivity-Aware Search. InProc. of SIGIR
2024
-
[46]
Jack McKechnie, Graham McDonald, and Craig Macdonald. 2025. Context Exam- ple Selection for LLM Generated Relevance Assessments. InProc. of ECIR
2025
- [47]
-
[48]
Michael S Moss and Tim J Gollins. 2017. Our digital legacy: an archival perspective. JCAS4, 2 (2017)
2017
- [49]
-
[50]
Oard, Fabrizio Sebastiani, and Jyothi K Vinjumur
Douglas W. Oard, Fabrizio Sebastiani, and Jyothi K Vinjumur. 2018. Jointly minimizing the expected costs of review for responsiveness and privilege in e-discovery.TOIS37, 1 (2018)
2018
-
[51]
Oard, Katie Shilton, and Jimmy Lin
Douglas W. Oard, Katie Shilton, and Jimmy Lin. 2016. Evaluating Search Among Secrets. InEVIA @ NTCIR
2016
-
[52]
Ronak Pradeep, Yuqi Liu, Xinyu Zhang, Yilin Li, Andrew Yates, and Jimmy Lin
-
[53]
Squeezing water from a stone: a bag of tricks for further improving cross- encoder effectiveness for reranking. InProc. of ECIR
- [54]
-
[55]
Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. 2000. Inference of population structure using multilocus genotype data.Genetics155, 2 (2000)
2000
-
[56]
Ronald Rivest. 1992. RFC1321: The MD5 message-digest algorithm
1992
-
[57]
Oard, and Katie Shilton
Mahmoud F Sayed, William Cox, Jonah Lynn Rivera, Caitlin Christian-Lamb, Modassir Iqbal, Douglas W. Oard, and Katie Shilton. 2020. A test collection for relevance and sensitivity. InProc. of SIGIR
2020
-
[58]
Mahmoud F Sayed and Douglas W Oard. 2019. Jointly modeling relevance and sensitivity for search among sensitive content. InProc. of SIGIR
2019
-
[59]
Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval.J. Doc28, 1 (1972)
1972
-
[60]
Karen Sparck Jones and Cornelis Joost Van Rijsbergen. 1976. Information retrieval test collections.J. Doc.32, 1 (1976)
1976
-
[61]
C Spearman. 1904. The Proof and Measurement of Association between Two Things.AJP15 (1904)
1904
-
[62]
C William Thomas. 2002. The rise and fall of Enron.J. Account193, 4 (2002)
2002
-
[63]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [64]
-
[65]
Jyothi K Vinjumur and Douglas W. Oard. 2015. Finding the privileged few: Supporting privilege review for e-discovery.Proc. of ASIS&T52, 1 (2015)
2015
-
[66]
Ellen M Voorhees. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. InProc. of SIGIR. A Sensitivity-Aware Test Collection for Search Among Personal Information SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia
1998
-
[67]
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[68]
Shuai Wang and Guido Zuccon. 2023. Balanced topic aware sampling for effective dense retriever: A reproducibility study. InProc. of SIGIR
2023
-
[69]
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou
-
[70]
Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.NeurIPS33 (2020)
2020
-
[71]
Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao. 2022. RetroMAE: Pre- training retrieval-oriented language models via masked auto-encoder. InProc. of EMNLP
2022
-
[72]
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InProc. of ICLR
2021
-
[73]
Emine Yilmaz, Javed A Aslam, and Stephen Robertson. 2008. A new rank correla- tion coefficient for information retrieval. InProc. of SIGIR
2008
-
[74]
Justine Zhang, James Pennebaker, Susan Dumais, and Eric Horvitz. 2020. Config- uring audiences: A case study of email communication.PACMHCI4, CSCW1 (2020)
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.