pith. sign in

arxiv: 2606.27559 · v1 · pith:4R4GGXSKnew · submitted 2026-06-25 · 💻 cs.IR

A Sensitivity-Aware Test Collection for Search Among Personal Information

Pith reviewed 2026-06-29 00:34 UTC · model grok-4.3

classification 💻 cs.IR
keywords sensitivity-aware searchtest collectionEnron corpusrelevance assessmentprivacy in information retrievalcrowdsourcingLLM judgmentspersonal information search
0
0 comments X

The pith

A new test collection from the Enron emails supplies queries, relevance judgments, and sensitivity labels to support search systems that avoid exposing private data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark resource for sensitivity-aware search using the Enron email corpus, which mixes useful business content with sensitive personal details. Researchers crowdsource 150 query formulations across 50 topics and collect 11,471 query-relevance assessments on documents already labeled for sensitivity. They extend the labels and judgments with large language models following established evaluation practices. Baseline results are reported for relevance ranking, sensitivity classification, and combined sensitivity-aware retrieval. The full collection, including indices, is released publicly to allow others to test and improve such systems.

Core claim

We crowdsource 150 query formulations for 50 different topics and 11,471 query-relevance assessments for a subset of the Enron documents that have been manually labelled for sensitivity. We follow best practices for using large language models (LLMs) in Information Retrieval evaluation to extend the collection further with additional LLM judged query-relevance assessments and sensitivity labels. We present baseline performances for relevance, sensitivity classification, and sensitivity-aware search on the collection.

What carries the argument

The sensitivity-aware test collection built from the Enron email corpus, consisting of crowdsourced queries, relevance assessments, and manual sensitivity labels, extended by LLM judgments.

If this is right

  • The collection lets researchers measure how well retrieval models can return relevant results while suppressing sensitive documents.
  • It provides a shared resource for comparing sensitivity classification methods on real personal correspondence.
  • Baseline scores establish reference points against which new sensitivity-aware ranking algorithms can be tested.
  • Public release through standard IR toolkits supports reproducible experiments with both sparse and dense retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same construction approach could be applied to other domains that mix useful records with private data, such as medical or financial archives.
  • LLM-assisted label extension may speed up collection building but could embed model-specific biases into future sensitivity-aware systems.
  • The resource opens the possibility of measuring the exact relevance cost imposed by adding explicit sensitivity constraints to standard search pipelines.

Load-bearing premise

Crowdsourced relevance and sensitivity labels, together with LLM-generated extensions, accurately capture real user needs and privacy concerns without systematic bias or low inter-annotator agreement.

What would settle it

A follow-up study that finds low agreement between the crowdsourced sensitivity labels and labels assigned by actual former Enron employees or privacy experts would show the collection does not model the target problem.

Figures

Figures reproduced from arXiv: 2606.27559 by Craig Macdonald, Graham McDonald, Jack McKechnie.

Figure 1
Figure 1. Figure 1: Example information need and queries. be relevant to any information need. For example, some emails are lists of prices or are one-word responses. As such, we deploy the QualT5 [8] model over the collection. QualT5 is typically used for index pruning to remove documents that are unlikely to be relevant to any topic. We use QualT5 to remove the 0.1% lowest quality docu￾ments. We select this threshold via qu… view at source ↗
Figure 2
Figure 2. Figure 2: Examples of sensitive documents from SARA. Shown are a non-sensitive document (top), a purely personal document (middle) and a document that is personal but in a professional context (bottom). employment and has some personal elements to it. Through these three example documents from the SARA collection, we can see the different types of documents present in the collection, and what makes them likely to be… view at source ↗
Figure 3
Figure 3. Figure 3: An example of accessing the SARA collection via ir_datasets and performing a PyTerrier experiment. information. As such, it must be possible to effectively identify the sensitive information that is included in a SAS test collection, with automated methods. Hence, we demonstrate baseline clas￾sification performance on the SARA collection. We train logistic regression (LR), support vector machine (SVM) and … view at source ↗
read the original abstract

Traditional search tasks aim to satisfy user information needs by returning a subset of a collection of documents, ranked by the documents' relevance to a user query. However, some collections that contain useful information also contain sensitive personal information. Recently, there has been increasing interest in the development of Sensitivity-Aware Search (SAS) retrieval models to provide users with effective retrieval results without revealing such sensitive information. To develop such systems, test collections containing both sensitive and non-sensitive information, a set of queries, and query-document relevance assessments are required. The Enron email corpus contains real business-related emails, where some emails also contain sensitive personal information. However, the original Enron collection does not contain queries or query-relevance assessments. To this end, we crowdsource 150 query formulations for 50 different topics and 11,471 query-relevance assessments for a subset of the Enron documents that have been manually labelled for sensitivity. We follow best practices for using large language models (LLMs) in Information Retrieval evaluation to extend the collection further with additional LLM judged query-relevance assessments and sensitivity labels. We present baseline performances for relevance, sensitivity classification, and sensitivity-aware search on the collection. We make the collection available, including through the popular ir_datasets package, and provide pre-built sparse and dense indices on Huggingface to facilitate easy experimentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript constructs and releases a sensitivity-aware search (SAS) test collection based on a manually sensitivity-labelled subset of the Enron email corpus. It reports crowdsourcing 150 query formulations across 50 topics together with 11,471 query-relevance assessments, extends the judgments with LLM-generated labels following stated best practices, supplies baseline results for relevance ranking, sensitivity classification, and sensitivity-aware retrieval, and makes the full resource (including sparse/dense indices) available via ir_datasets and Hugging Face.

Significance. If label quality is adequately documented, the collection supplies a needed public resource for evaluating retrieval systems that must trade off relevance against privacy constraints in personal collections. Integration with standard IR tooling and the provision of baselines lower the barrier for follow-on work in sensitivity-aware IR.

major comments (1)
  1. [Abstract] Abstract: the statement that 'best practices for using large language models (LLMs) in Information Retrieval evaluation' were followed is not accompanied by any reported inter-annotator agreement figures, label-quality statistics, or validation results comparing LLM judgments to human gold labels. These metrics are load-bearing for assessing the reliability of the released collection.
minor comments (1)
  1. A compact table collating collection statistics (topics, queries, documents, assessment counts, sensitivity distribution) would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and recommendation of minor revision. We address the sole major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the statement that 'best practices for using large language models (LLMs) in Information Retrieval evaluation' were followed is not accompanied by any reported inter-annotator agreement figures, label-quality statistics, or validation results comparing LLM judgments to human gold labels. These metrics are load-bearing for assessing the reliability of the released collection.

    Authors: We agree that the abstract claim would benefit from explicit reference to supporting metrics. The manuscript body details the LLM prompting strategy (following the cited best-practice guidelines) and reports a human validation subset comparing LLM labels to crowdsourced gold labels, along with label-quality statistics. We will revise the abstract to include a concise parenthetical summary of these results (e.g., agreement rate with human gold labels) so that the reliability claim is directly supported in the abstract itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a standard resource-construction contribution that describes crowdsourcing 150 queries and 11,471 assessments on a sensitivity-labelled Enron subset, followed by LLM extensions and baseline runs. No equations, fitted parameters, derivations, or predictions appear in the described pipeline; the central claim is the production and release of the collection itself, which is independent of any prior fitted result or self-citation chain. The work follows established IR practices for new test collections without reducing any load-bearing step to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central contribution rests on two domain assumptions: that crowdsourced annotations yield reliable relevance and sensitivity labels, and that LLM judgments can be added safely when best-practice guidelines are followed. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Crowdsourced relevance and sensitivity labels on Enron emails are sufficiently reliable for IR evaluation
    Invoked when the paper states that 11,471 assessments were collected and used as ground truth.
  • domain assumption LLM judgments can extend the collection while preserving label quality when best practices are followed
    Invoked when the paper describes using LLMs to add further query-relevance and sensitivity labels.

pith-pipeline@v0.9.1-grok · 5768 in / 1340 out tokens · 26611 ms · 2026-06-29T00:34:12.598590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    Nasreen Abdul-Jaleel, James Allan, W Bruce Croft, Fernando Diaz, Leah Larkey, Xiaoyan Li, Mark D Smucker, and Courtney Wade. 2004. UMass at TREC 2004: Novelty and HARD. InProc. of TREC

  2. [2]

    Giambattista Amati, Giuseppe Amodeo, Marco Bianchi, Carlo Gaibisso, and Giorgio Gambosi. 2008. FUB, IASI-CNR and University of Tor Vergata at TREC 2008 blog track. InProc. of TREC

  3. [3]

    Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness.TOIS 20, 4 (2002)

  4. [4]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report.arXiv preprint arXiv:2309.16609(2023)

  5. [5]

    Jason R Baron, Nathaniel W Rollings, and Douglas W. Oard. 2023. Using ChatGPT for the FOIA Exemption 5 Deliberative Process Privilege. InLegalAIIA @ ICAIL

  6. [6]

    Karl Branting, Bradford Brown, Chris Giannella, James Van Guilder, Jeff Harrold, Sarah Howell, and Jason R Baron. 2025. Decision support for detecting sensitive text in government records: Anonymous submission.AI Law33, 1 (2025)

  7. [7]

    Naciye Celebi and Narasimha Shashidhar. 2022. Topic modeling in the enron dataset. InProc. of BigData

  8. [8]

    Xuejun Chang, Debabrata Mishra, Craig Macdonald, and Sean MacAvaney. 2024. Neural Passage Quality Estimation for Static Pruning. InProc. of SIGIR

  9. [9]

    Gordon V Cormack, Maura R Grossman, Bruce Hedin, and Douglas W. Oard

  10. [10]

    Overview of the TREC 2010 legal track. InProc. of TREC

  11. [11]

    John Dang, Shivalika Singh, Daniel D’souza, Arash Ahmadian, Alejandro Sala- manca, Madeline Smith, Aidan Peppin, Sungjin Hong, Manoj Govindassamy, Terrence Zhao, et al. 2024. Aya expanse: Combining research breakthroughs for a new multilingual frontier.arXiv preprint arXiv:2412.04261(2024)

  12. [12]

    Department for Science, Innovation and Technology, Department for Business and Trade, Office for Artificial Intelligence, Department for Digital, Culture, Media & Sport, and Department for Business, En- ergy & Industrial Strategy. 2019. Artificial Intelligence Sector Deal. https://www.gov.uk/government/publications/artificial-intelligence-sector-deal (2019)

  13. [13]

    Laura Dietz, Oleg Zendel, Peter Bailey, Charles LA Clarke, Ellese Cotterill, Jeff Dalton, Faegheh Hasibi, Mark Sanderson, and Nick Craswell. 2025. Principles and Guidelines for the Use of LLM Judges. InProc. of ICTIR

  14. [14]

    Guglielmo Faggioli, Laura Dietz, Charles LA Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, et al. 2023. Perspectives on large language models for relevance judgment. InProc. of ICTIR

  15. [16]

    arXiv preprint arXiv:2109.10086(2021)

    SPLADE v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086(2021)

  16. [17]

    Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant

  17. [18]

    Towards effective and efficient sparse neural information retrieval.TOIS 42, 5 (2024)

  18. [19]

    Timothy Gollins, Graham McDonald, Craig Macdonald, and Iadh Ounis. 2014. On Using Information Retrieval for the Selection and Sensitivity Review of Digital Public Records. InPIR @ SIGIR

  19. [20]

    Marti A Hearst. 2005. Teaching applied natural language processing: Triumphs and tribulations. InETMTNLP @ ACL

  20. [21]

    Modassir Iqbal, Katie Shilton, Mahmoud F Sayed, Douglas Oard, Jonah Lynn Rivera, and William Cox. 2021. Search with discretion: value sensitive design of training data for information retrieval.Proc. of PACMHCI5 (2021)

  21. [22]

    2022.Archives, access and artificial intelligence: working with born- digital and digitized archival collections

    Lise Jaillant. 2022.Archives, access and artificial intelligence: working with born- digital and digitized archival collections. Bielefeld University Press

  22. [23]

    Lise Jaillant. 2022. How can we make born-digital and digitised archives more accessible? Identifying obstacles and solutions.Arch. Sci.22, 3 (2022)

  23. [24]

    Lise Jaillant and Annalina Caputo. 2022. Unlocking digital archives: cross- disciplinary perspectives on AI and born-digital data.AI Soc37, 3 (2022)

  24. [25]

    Lise Jaillant and Arran Rees. 2023. Applying AI to digital archives: trust, collabo- ration and shared professional ethics.DSH38, 2 (2023)

  25. [26]

    Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.TOIS20, 4 (2002)

  26. [27]

    Robertson

    Karen Sparck Jones, Steve Walker, and Stephen E. Robertson. 2000. A probabilistic model of information retrieval: development and comparative experiments: Part

  27. [28]

    Maurice G Kendall. 1938. A new measure of rank correlation.Biometrika30, 1-2 (1938)

  28. [29]

    Bryan Klimt and Yiming Yang. 2004. Introducing the Enron corpus. InProc. of CEAS

  29. [30]

    Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE.arXiv preprint arXiv:2403.06789(2024)

  30. [31]

    Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In RepL4NLP @ ACL

  31. [32]

    Carolyn E Lipscomb. 2000. Medical subject headings (MeSH).BMLA88, 3 (2000)

  32. [33]

    Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. 2024. Fine- tuning llama for multi-stage text retrieval. InProc. of SIGIR

  33. [34]

    Sean MacAvaney. 2025. Artifact Sharing for Information Retrieval Research. In Proc. of SIGIR

  34. [35]

    Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, and Nazli Goharian. 2021. Simplified data wrangling with ir_datasets. InProc. of SIGIR

  35. [36]

    Maarten Marx, Maik Larooij, Filipp Perasedillo, and Jaap Kamps. 2023. Enticing Local Governments to Produce FAIR Freedom of Information Act Dossiers. In Proc. of ECIR

  36. [37]

    Andrew McCallum, Andrés Corrada-Emmanuel, and Xuerui Wang. 2004. The author-recipient-topic model for topic and role discovery in social networks: Experiments with enron and academic email. InNIPS’04 Workshop on Structured Data and Representations in Probabilistic Models for Categorization

  37. [38]

    Graham McDonald, Nicolás García-Pedrajas, Craig Macdonald, and Iadh Ounis

  38. [39]

    A study of SVM kernel functions for sensitivity classification ensembles with POS sequences. InProc. of SIGIR

  39. [40]

    Graham McDonald, Craig Macdonald, and Iadh Ounis. 2017. Enhancing sensi- tivity classification with semantic features using word embeddings. InProc. of ECIR

  40. [41]

    Graham McDonald, Craig Macdonald, and Iadh Ounis. 2018. Towards maximising openness in digital sensitivity review using reviewing time predictions. InProc. of ECIR

  41. [42]

    Graham McDonald, Craig Macdonald, and Iadh Ounis. 2020. How the accuracy and confidence of sensitivity classification affects digital sensitivity review.TOIS 39, 1 (2020)

  42. [43]

    Graham McDonald, Craig Macdonald, Iadh Ounis, and Timothy Gollins. 2014. Towards a classifier for digital sensitivity review. InProc. of ECIR

  43. [44]

    Jack McKechnie. 2024. Cascading Ranking Pipelines for Sensitivity-Aware Search. InProc. of ECIR

  44. [45]

    Jack McKechnie, Graham McDonald, and Craig Macdonald. 2024. Bi-Objective Negative Sampling for Sensitivity-Aware Search. InProc. of SIGIR

  45. [46]

    Jack McKechnie, Graham McDonald, and Craig Macdonald. 2025. Context Exam- ple Selection for LLM Generated Relevance Assessments. InProc. of ECIR

  46. [47]

    Chuan Meng, Jiqun Liu, Mohammad Aliannejadi, Fengran Mo, Jeff Dalton, and Maarten de Rijke. 2026. Re-Rankers as Relevance Judges.arXiv preprint arXiv:2601.04455(2026)

  47. [48]

    Michael S Moss and Tim J Gollins. 2017. Our digital legacy: an archival perspective. JCAS4, 2 (2017)

  48. [49]

    Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. 2019. Multi-stage document ranking with BERT.arXiv preprint arXiv:1910.14424(2019)

  49. [50]

    Oard, Fabrizio Sebastiani, and Jyothi K Vinjumur

    Douglas W. Oard, Fabrizio Sebastiani, and Jyothi K Vinjumur. 2018. Jointly minimizing the expected costs of review for responsiveness and privilege in e-discovery.TOIS37, 1 (2018)

  50. [51]

    Oard, Katie Shilton, and Jimmy Lin

    Douglas W. Oard, Katie Shilton, and Jimmy Lin. 2016. Evaluating Search Among Secrets. InEVIA @ NTCIR

  51. [52]

    Ronak Pradeep, Yuqi Liu, Xinyu Zhang, Yilin Li, Andrew Yates, and Jimmy Lin

  52. [53]

    Squeezing water from a stone: a bag of tricks for further improving cross- encoder effectiveness for reranking. InProc. of ECIR

  53. [54]

    Ronak Pradeep, Rodrigo Nogueira, and Jimmy Lin. 2021. The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models. arXiv preprint arXiv:2101.05667(2021)

  54. [55]

    Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. 2000. Inference of population structure using multilocus genotype data.Genetics155, 2 (2000)

  55. [56]

    Ronald Rivest. 1992. RFC1321: The MD5 message-digest algorithm

  56. [57]

    Oard, and Katie Shilton

    Mahmoud F Sayed, William Cox, Jonah Lynn Rivera, Caitlin Christian-Lamb, Modassir Iqbal, Douglas W. Oard, and Katie Shilton. 2020. A test collection for relevance and sensitivity. InProc. of SIGIR

  57. [58]

    Mahmoud F Sayed and Douglas W Oard. 2019. Jointly modeling relevance and sensitivity for search among sensitive content. InProc. of SIGIR

  58. [59]

    Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval.J. Doc28, 1 (1972)

  59. [60]

    Karen Sparck Jones and Cornelis Joost Van Rijsbergen. 1976. Information retrieval test collections.J. Doc.32, 1 (1976)

  60. [61]

    C Spearman. 1904. The Proof and Measurement of Association between Two Things.AJP15 (1904)

  61. [62]

    C William Thomas. 2002. The rise and fall of Enron.J. Account193, 4 (2002)

  62. [63]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)

  63. [64]

    Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, and Jimmy Lin. 2024. Umbrela: Umbrela is the (open-source reproduction of the) bing relevance assessor.arXiv preprint arXiv:2406.06519(2024)

  64. [65]

    Jyothi K Vinjumur and Douglas W. Oard. 2015. Finding the privileged few: Supporting privilege review for e-discovery.Proc. of ASIS&T52, 1 (2015)

  65. [66]

    Ellen M Voorhees. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. InProc. of SIGIR. A Sensitivity-Aware Test Collection for Search Among Personal Information SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia

  66. [67]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised contrastive pre-training.arXiv preprint arXiv:2212.03533(2022)

  67. [68]

    Shuai Wang and Guido Zuccon. 2023. Balanced topic aware sampling for effective dense retriever: A reproducibility study. InProc. of SIGIR

  68. [69]

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou

  69. [70]

    Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.NeurIPS33 (2020)

  70. [71]

    Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao. 2022. RetroMAE: Pre- training retrieval-oriented language models via masked auto-encoder. InProc. of EMNLP

  71. [72]

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InProc. of ICLR

  72. [73]

    Emine Yilmaz, Javed A Aslam, and Stephen Robertson. 2008. A new rank correla- tion coefficient for information retrieval. InProc. of SIGIR

  73. [74]

    Justine Zhang, James Pennebaker, Susan Dumais, and Eric Horvitz. 2020. Config- uring audiences: A case study of email communication.PACMHCI4, CSCW1 (2020)