To Redact, or not to Redact? A Local LLM Approach to Deliberative Process Privilege Classification
Pith reviewed 2026-05-12 04:30 UTC · model grok-4.3
The pith
A 9B local model with chain-of-thought and error-based few-shot prompting classifies deliberative process privilege nearly as well as commercial models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that Chain-of-Thought prompting combined with few-shot prompting using error-based examples on the Qwen3.5 9B model outperforms prior classification models on recall and F2 score for deliberative process privilege and approaches the performance of Gemini 2.5 Flash. Predicted deliberative sentences contain more opinion-expressing verbs and first-person phrasing, and deliberativeness appears most strongly marked by the joint occurrence of these indicators rather than any single cue.
What carries the argument
The Chain-of-Thought plus error-based few-shot prompting combination applied to sentence-level binary classification of deliberative content in the local 9B model.
If this is right
- Transparency offices could run the classifier locally to decide redactions without transmitting unreviewed documents to third-party services.
- The higher recall reduces the chance that truly deliberative material is released by mistake.
- The identified linguistic patterns offer a starting point for lighter-weight or hybrid detection methods.
- Error-based example selection provides a repeatable way to improve prompting for domain-specific classification tasks.
Where Pith is reading between the lines
- The same prompting recipe could be tested on other FOIA exemptions that also hinge on intent or context.
- Consumer-grade deployment lowers the barrier for smaller agencies or public oversight groups to perform their own classification.
- The combination of first-person and opinion verbs might be captured by simpler keyword or syntactic rules in some document types.
- Measuring how performance changes across document formats or government agencies would show where retraining or prompt tuning is required.
Load-bearing premise
That the performance measured on the evaluated dataset will carry over to real FOIA documents and that running the model on consumer hardware will not produce meaningful drops in accuracy or speed.
What would settle it
Running the same prompting setup on a fresh collection of actual FOIA-released documents whose redaction decisions are already known and checking whether recall and F2 remain at the reported levels.
Figures
read the original abstract
Government transparency laws, like the Freedom of Information (FOIA) acts in the United States and United Kingdom, and the Woo (Open Government Act) in the Netherlands, grant citizens the right to directly request documents from the government. As these documents might contain sensitive information, such as personal information or threats to national security, the laws allow governments to redact sensitive parts of the documents prior to release. We build on prior research to perform automatic sensitivity classification for the FOIA Exemption 5 deliberative process privilege using Large Language Models (LLMs). However, processing documents not yet cleared for review via third-party cloud APIs is often legally or politically untenable. Therefore, in this work, we perform sensitivity classification with a small, local model, deployable on consumer-grade hardware (Qwen3.5 9B). We compare eight variants of applying LLMs for sentence classification, using well-known prompting techniques, and find that a combination of Chain-of-Thought prompting and few-shot prompting with error-based examples outperforms classification models of earlier work in terms of recall and F2 score. This method also closely approaches the performance of a widely-used, cost-efficient commercial model (Gemini 2.5 Flash). In an additional analysis, we find that sentences that are predicted as deliberative contain more verbs that indicate the expression of opinions, and are more often phrased in in first-person. Above all, deliberativeness seems characterized by the presence of a combination of multiple indicators, in particular the combination of first-person words with a verb for expressing opinion.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates the use of a local 9B-parameter LLM (Qwen3.5) for sentence-level classification of deliberative-process-privilege content under FOIA Exemption 5. Eight prompting variants are compared; the authors report that Chain-of-Thought combined with few-shot examples selected on the basis of prior errors yields higher recall and F2 than earlier non-LLM classifiers while approaching the performance of Gemini 2.5 Flash. A secondary linguistic analysis finds that predicted deliberative sentences contain more opinion-expressing verbs and first-person markers, often in combination.
Significance. If the empirical comparisons are reproducible and the test distribution is representative, the work demonstrates that small, locally deployable models can perform competitively on a legally sensitive classification task without cloud APIs. This has direct implications for privacy-preserving FOIA processing pipelines. The linguistic feature analysis supplies a modest interpretability contribution. The absence of dataset statistics, statistical tests, and cross-domain validation in the abstract, however, prevents a firm judgment on whether the reported gains are robust or dataset-specific.
major comments (3)
- [Abstract] Abstract: the claim of outperformance in recall and F2 is presented without any dataset size, class balance, number of documents, or statistical significance tests for the comparisons against prior classifiers. This information is load-bearing for the central empirical claim.
- [Evaluation] Evaluation protocol (inferred from abstract and results description): no cross-agency, cross-topic, or temporal hold-out validation is described. Given that real FOIA corpora vary substantially by agency, document length, and redaction style, the reported gains may not generalize beyond the (unspecified) test set.
- [Results] Results section: the abstract states that the best prompting variant 'closely approaches' Gemini 2.5 Flash, yet supplies neither the exact F2/recall deltas nor confidence intervals, making it impossible to judge whether the local model is practically interchangeable.
minor comments (1)
- [Abstract] The abstract refers to 'earlier work' without citing the specific prior classification models or their reported metrics, which would aid readers in assessing the magnitude of improvement.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of outperformance in recall and F2 is presented without any dataset size, class balance, number of documents, or statistical significance tests for the comparisons against prior classifiers. This information is load-bearing for the central empirical claim.
Authors: We agree that the abstract would be strengthened by including these details. In the revised manuscript we will expand the abstract to report the total number of documents and sentences in the dataset, the class distribution, and a brief statement that the reported gains in recall and F2 over prior classifiers are statistically significant according to the tests already presented in the results section. revision: yes
-
Referee: [Evaluation] Evaluation protocol (inferred from abstract and results description): no cross-agency, cross-topic, or temporal hold-out validation is described. Given that real FOIA corpora vary substantially by agency, document length, and redaction style, the reported gains may not generalize beyond the (unspecified) test set.
Authors: The evaluation protocol, including the construction and split of the test set, is described in the Methods section of the full manuscript. We did not conduct explicit cross-agency, cross-topic, or temporal hold-out experiments in the present study. We will add a limitations subsection that explicitly discusses this scope and the potential for domain shift across FOIA corpora, while preserving the current experimental design. revision: partial
-
Referee: [Results] Results section: the abstract states that the best prompting variant 'closely approaches' Gemini 2.5 Flash, yet supplies neither the exact F2/recall deltas nor confidence intervals, making it impossible to judge whether the local model is practically interchangeable.
Authors: We will revise the abstract to include the exact recall and F2 scores for the best local prompting variant and for Gemini 2.5 Flash, together with any confidence intervals or standard deviations obtained from the repeated runs reported in the results. This will allow readers to assess the practical difference in performance directly. revision: yes
Circularity Check
No circularity: purely empirical prompting comparison
full rationale
The paper reports an empirical evaluation of eight LLM prompting variants (including CoT and error-based few-shot) on sentence-level classification for FOIA deliberative process privilege. Performance is measured via recall and F2 against prior classification models and Gemini 2.5 Flash, with an auxiliary linguistic analysis of predicted sentences. No equations, fitted parameters, self-definitional constructs, or derivation chains exist; results rest on external model outputs and dataset metrics rather than internal definitions or self-citations that reduce the central claim to its inputs. The work is self-contained as a prompting experiment.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can perform nuanced sentence-level classification for legal privilege detection using only prompting techniques without fine-tuning
Reference graph
Works this paper leans on
-
[1]
Jason R Baron, Nathaniel W Rollings, and Douglas W Oard. 2023. Using ChatGPT for the FOIA Exemption 5 Deliberative Process Privilege.. InLegalAIIA@ ICAIL. 32–48
work page 2023
-
[2]
Jason R Baron, Mahmoud F Sayed, and Douglas W Oard. 2022. Providing more efficient access to government records: a use case involving application of ma- chine learning to improve FOIA Review for the deliberative process privilege. ACM Journal on Computing and Cultural Heritage (JOCCH)15, 1 (2022), 1–19
work page 2022
-
[3]
Giacomo Berardi, Andrea Esuli, Craig Macdonald, Iadh Ounis, and Fabrizio Sebastiani. 2015. Semi-automated text classification for sensitivity identification. InProceedings of the 24th ACM International on Conference on Information and Knowledge Management. 1711–1714
work page 2015
-
[4]
Julian Boyd and James P Thorne. 1969. The semantics of modal verbs.Journal of linguistics5, 1 (1969), 57–74
work page 1969
-
[5]
Karl Branting, Bradford Brown, Chris Giannella, James Van Guilder, Jeff Harrold, Sarah Howell, and Jason R Baron. 2025. Decision support for detecting sensitive text in government records.Artificial Intelligence & Law33, 1 (2025)
work page 2025
-
[6]
And I Think That Is a Very Straightforward Way of Dealing With It
Anita Fetzer. 2008. “And I Think That Is a Very Straightforward Way of Dealing With It” The Communicative Function of Cognitive Verbs in Political Discourse. Journal of Language and Social Psychology27, 4 (2008), 384–396
work page 2008
-
[7]
Mark H Grunewald. 1998. E-FOIA and the mother of all complaints: Information delivery and delay reduction.Admin. L. Rev.50 (1998), 345
work page 1998
-
[8]
George Lakoff. 1966. Stative adjectives and verbs in English. (1966)
work page 1966
-
[9]
Graham McDonald, Nicolás García-Pedrajas, Craig Macdonald, and Iadh Ounis
-
[10]
A study of SVM kernel functions for sensitivity classification ensembles with POS sequences. InProceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 1097–1100
-
[11]
Graham McDonald, Craig Macdonald, and Iadh Ounis. 2015. Using part-of-speech n-grams for sensitive-text classification. InProceedings of the 2015 International conference on the theory of information retrieval. 381–384
work page 2015
-
[12]
Graham McDonald, Craig Macdonald, and Iadh Ounis. 2017. Enhancing sensitiv- ity classification with semantic features using word embeddings. InEuropean Conference on Information Retrieval. Springer, 450–463
work page 2017
-
[13]
Graham McDonald, Craig Macdonald, and Iadh Ounis. 2018. Active learning strategies for technology assisted sensitivity review. InEuropean Conference on Information Retrieval. Springer, 439–453
work page 2018
-
[14]
Graham Mcdonald, Craig Macdonald, and Iadh Ounis. 2020. How the accuracy and confidence of sensitivity classification affects digital sensitivity review.ACM Transactions on Information Systems (TOIS)39, 1 (2020), 1–34
work page 2020
-
[15]
Graham McDonald, Craig Macdonald, Iadh Ounis, and Timothy Gollins. 2014. Towards a classifier for digital sensitivity review. InEuropean Conference on Information Retrieval. Springer, 500–506
work page 2014
-
[16]
Jack McKechnie. 2024. Cascading Ranking Pipelines for Sensitivity-Aware Search. InEuropean Conference on Information Retrieval. Springer, 331–333
work page 2024
-
[17]
Jack McKechnie, Graham McDonald, and Craig Macdonald. 2024. Bi-Objective Negative Sampling for Sensitivity-Aware Search. InProceedings of the 47th In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval. 2296–2300
work page 2024
-
[18]
Hitarth Narvala, Graham McDonald, and Iadh Ounis. 2021. RelDiff: Enriching knowledge graph relation representations for sensitivity classification. InFindings of the Association for Computational Linguistics: EMNLP 2021. 3671–3681
work page 2021
-
[19]
Hitarth Narvala, Graham Mcdonald, and Iadh Ounis. 2022. The role of latent se- mantic categories and clustering in enhancing the efficiency of human sensitivity review. InProceedings of the 2022 Conference on Human Information Interaction and Retrieval. 56–66
work page 2022
-
[20]
Hitarth Narvala, Graham McDonald, and Iadh Ounis. 2022. Sensitivity review of large collections by identifying and prioritising coherent documents groups. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4931–4935
work page 2022
-
[21]
Douglas W Oard, Katie Shilton, and Jimmy Lin. 2016. Evaluating Search Among Secrets.. InEVIA@ NTCIR
work page 2016
-
[22]
Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen. ai/blog?id=qwen3.5
work page 2026
-
[23]
Mahmoud F Sayed and Douglas W Oard. 2019. Jointly modeling relevance and sensitivity for search among sensitive content. InProceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 615–624
work page 2019
-
[24]
Geoff Thompson and Ye Yiyun. 1991. Evaluation in the reporting verbs used in academic papers.Applied linguistics12, 4 (1991), 365–382
work page 1991
-
[25]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837
work page 2022
-
[26]
Serv Wiemers, Guido Enthoven, Maarten Marx, and Linde Berger. 2026. EEN ZWALUW: ANALYSE AFHANDELING WOO-VERZOEKEN 2025
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.