To Redact, or not to Redact? A Local LLM Approach to Deliberative Process Privilege Classification

David Graus; Maik Larooij

arxiv: 2605.10211 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI· cs.IR

To Redact, or not to Redact? A Local LLM Approach to Deliberative Process Privilege Classification

Maik Larooij , David Graus This is my paper

Pith reviewed 2026-05-12 04:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords deliberative process privilegeFOIAlocal LLMchain-of-thought promptingfew-shot promptingsentence classificationsensitivity classificationgovernment transparency

0 comments

The pith

A 9B local model with chain-of-thought and error-based few-shot prompting classifies deliberative process privilege nearly as well as commercial models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether small language models that run on ordinary computers can automatically flag sentences in government documents that qualify for the deliberative process privilege exemption under laws such as FOIA. Eight prompting strategies are compared on a 9 billion parameter local model, and the combination of chain-of-thought reasoning plus few-shot examples drawn from prior mistakes produces higher recall and F2 scores than earlier classification systems while coming close to the results of a commercial model. The same sentences also show more first-person language and verbs that express opinions, with the joint presence of both features proving especially diagnostic. This setup avoids sending unreleased documents to external services, which matters because such transfers raise legal and political barriers for transparency offices.

Core claim

We show that Chain-of-Thought prompting combined with few-shot prompting using error-based examples on the Qwen3.5 9B model outperforms prior classification models on recall and F2 score for deliberative process privilege and approaches the performance of Gemini 2.5 Flash. Predicted deliberative sentences contain more opinion-expressing verbs and first-person phrasing, and deliberativeness appears most strongly marked by the joint occurrence of these indicators rather than any single cue.

What carries the argument

The Chain-of-Thought plus error-based few-shot prompting combination applied to sentence-level binary classification of deliberative content in the local 9B model.

If this is right

Transparency offices could run the classifier locally to decide redactions without transmitting unreviewed documents to third-party services.
The higher recall reduces the chance that truly deliberative material is released by mistake.
The identified linguistic patterns offer a starting point for lighter-weight or hybrid detection methods.
Error-based example selection provides a repeatable way to improve prompting for domain-specific classification tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting recipe could be tested on other FOIA exemptions that also hinge on intent or context.
Consumer-grade deployment lowers the barrier for smaller agencies or public oversight groups to perform their own classification.
The combination of first-person and opinion verbs might be captured by simpler keyword or syntactic rules in some document types.
Measuring how performance changes across document formats or government agencies would show where retraining or prompt tuning is required.

Load-bearing premise

That the performance measured on the evaluated dataset will carry over to real FOIA documents and that running the model on consumer hardware will not produce meaningful drops in accuracy or speed.

What would settle it

Running the same prompting setup on a fresh collection of actual FOIA-released documents whose redaction decisions are already known and checking whether recall and F2 remain at the reported levels.

Figures

Figures reproduced from arXiv: 2605.10211 by David Graus, Maik Larooij.

**Figure 2.** Figure 2: Comparison of precision of deliberative sentences (AD) in batch K2 for the different models and variants. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Government transparency laws, like the Freedom of Information (FOIA) acts in the United States and United Kingdom, and the Woo (Open Government Act) in the Netherlands, grant citizens the right to directly request documents from the government. As these documents might contain sensitive information, such as personal information or threats to national security, the laws allow governments to redact sensitive parts of the documents prior to release. We build on prior research to perform automatic sensitivity classification for the FOIA Exemption 5 deliberative process privilege using Large Language Models (LLMs). However, processing documents not yet cleared for review via third-party cloud APIs is often legally or politically untenable. Therefore, in this work, we perform sensitivity classification with a small, local model, deployable on consumer-grade hardware (Qwen3.5 9B). We compare eight variants of applying LLMs for sentence classification, using well-known prompting techniques, and find that a combination of Chain-of-Thought prompting and few-shot prompting with error-based examples outperforms classification models of earlier work in terms of recall and F2 score. This method also closely approaches the performance of a widely-used, cost-efficient commercial model (Gemini 2.5 Flash). In an additional analysis, we find that sentences that are predicted as deliberative contain more verbs that indicate the expression of opinions, and are more often phrased in in first-person. Above all, deliberativeness seems characterized by the presence of a combination of multiple indicators, in particular the combination of first-person words with a verb for expressing opinion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A 9B local model with CoT plus error-based few-shot prompting beats some prior classifiers on deliberative FOIA sentences and nearly matches Gemini, but the evaluation details are too thin to judge reliability or scope.

read the letter

The paper's core result is that Qwen3.5-9B, run locally, reaches higher recall and F2 than earlier classification models on this task when it combines chain-of-thought with few-shot examples drawn from prior errors, and it comes close to Gemini 2.5 Flash without sending documents off-site. That local-only constraint is the practical point they get right, since cloud APIs are often ruled out for uncleared FOIA material. The eight-variant comparison is direct, and the follow-up check that predicted deliberative sentences contain more first-person opinion verbs is a straightforward way to see what the model is actually using. It keeps the claim modest and ties back to the data rather than over-interpreting the prompting trick itself. The linguistic observation is new enough relative to the cited sensitivity-classification papers to count as a small addition. The soft spots sit in the evaluation. No dataset sizes, splits, or statistical tests appear in the abstract, and the full protocol is not described, so it is impossible to tell whether the gains are stable or tied to one narrow collection of documents. Real FOIA corpora differ by agency, length, and topic, and nothing in the write-up tests cross-domain or temporal hold-out. That makes the generalization claim the weakest part, exactly as the stress-test note flags. This is the sort of applied note that could help people building local tools for government transparency workflows. A reader who needs a concrete prompting recipe for legal sentence classification on consumer hardware will get usable ideas. It is not foundational, but the local-model focus and the honest empirical comparison are solid enough that it deserves a serious referee who can ask for the missing dataset details, error analysis, and at least one cross-validation check. I would send it to review rather than desk-reject.

Referee Report

3 major / 1 minor

Summary. The manuscript evaluates the use of a local 9B-parameter LLM (Qwen3.5) for sentence-level classification of deliberative-process-privilege content under FOIA Exemption 5. Eight prompting variants are compared; the authors report that Chain-of-Thought combined with few-shot examples selected on the basis of prior errors yields higher recall and F2 than earlier non-LLM classifiers while approaching the performance of Gemini 2.5 Flash. A secondary linguistic analysis finds that predicted deliberative sentences contain more opinion-expressing verbs and first-person markers, often in combination.

Significance. If the empirical comparisons are reproducible and the test distribution is representative, the work demonstrates that small, locally deployable models can perform competitively on a legally sensitive classification task without cloud APIs. This has direct implications for privacy-preserving FOIA processing pipelines. The linguistic feature analysis supplies a modest interpretability contribution. The absence of dataset statistics, statistical tests, and cross-domain validation in the abstract, however, prevents a firm judgment on whether the reported gains are robust or dataset-specific.

major comments (3)

[Abstract] Abstract: the claim of outperformance in recall and F2 is presented without any dataset size, class balance, number of documents, or statistical significance tests for the comparisons against prior classifiers. This information is load-bearing for the central empirical claim.
[Evaluation] Evaluation protocol (inferred from abstract and results description): no cross-agency, cross-topic, or temporal hold-out validation is described. Given that real FOIA corpora vary substantially by agency, document length, and redaction style, the reported gains may not generalize beyond the (unspecified) test set.
[Results] Results section: the abstract states that the best prompting variant 'closely approaches' Gemini 2.5 Flash, yet supplies neither the exact F2/recall deltas nor confidence intervals, making it impossible to judge whether the local model is practically interchangeable.

minor comments (1)

[Abstract] The abstract refers to 'earlier work' without citing the specific prior classification models or their reported metrics, which would aid readers in assessing the magnitude of improvement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of outperformance in recall and F2 is presented without any dataset size, class balance, number of documents, or statistical significance tests for the comparisons against prior classifiers. This information is load-bearing for the central empirical claim.

Authors: We agree that the abstract would be strengthened by including these details. In the revised manuscript we will expand the abstract to report the total number of documents and sentences in the dataset, the class distribution, and a brief statement that the reported gains in recall and F2 over prior classifiers are statistically significant according to the tests already presented in the results section. revision: yes
Referee: [Evaluation] Evaluation protocol (inferred from abstract and results description): no cross-agency, cross-topic, or temporal hold-out validation is described. Given that real FOIA corpora vary substantially by agency, document length, and redaction style, the reported gains may not generalize beyond the (unspecified) test set.

Authors: The evaluation protocol, including the construction and split of the test set, is described in the Methods section of the full manuscript. We did not conduct explicit cross-agency, cross-topic, or temporal hold-out experiments in the present study. We will add a limitations subsection that explicitly discusses this scope and the potential for domain shift across FOIA corpora, while preserving the current experimental design. revision: partial
Referee: [Results] Results section: the abstract states that the best prompting variant 'closely approaches' Gemini 2.5 Flash, yet supplies neither the exact F2/recall deltas nor confidence intervals, making it impossible to judge whether the local model is practically interchangeable.

Authors: We will revise the abstract to include the exact recall and F2 scores for the best local prompting variant and for Gemini 2.5 Flash, together with any confidence intervals or standard deviations obtained from the repeated runs reported in the results. This will allow readers to assess the practical difference in performance directly. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical prompting comparison

full rationale

The paper reports an empirical evaluation of eight LLM prompting variants (including CoT and error-based few-shot) on sentence-level classification for FOIA deliberative process privilege. Performance is measured via recall and F2 against prior classification models and Gemini 2.5 Flash, with an auxiliary linguistic analysis of predicted sentences. No equations, fitted parameters, self-definitional constructs, or derivation chains exist; results rest on external model outputs and dataset metrics rather than internal definitions or self-citations that reduce the central claim to its inputs. The work is self-contained as a prompting experiment.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes LLMs can reliably perform the classification task via prompting alone; no new mathematical entities or fitted constants are introduced.

axioms (1)

domain assumption LLMs can perform nuanced sentence-level classification for legal privilege detection using only prompting techniques without fine-tuning
Central to all eight variants tested and the performance claims.

pith-pipeline@v0.9.0 · 5588 in / 1134 out tokens · 56024 ms · 2026-05-12T04:30:59.455813+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Jason R Baron, Nathaniel W Rollings, and Douglas W Oard. 2023. Using ChatGPT for the FOIA Exemption 5 Deliberative Process Privilege.. InLegalAIIA@ ICAIL. 32–48

work page 2023
[2]

Jason R Baron, Mahmoud F Sayed, and Douglas W Oard. 2022. Providing more efficient access to government records: a use case involving application of ma- chine learning to improve FOIA Review for the deliberative process privilege. ACM Journal on Computing and Cultural Heritage (JOCCH)15, 1 (2022), 1–19

work page 2022
[3]

Giacomo Berardi, Andrea Esuli, Craig Macdonald, Iadh Ounis, and Fabrizio Sebastiani. 2015. Semi-automated text classification for sensitivity identification. InProceedings of the 24th ACM International on Conference on Information and Knowledge Management. 1711–1714

work page 2015
[4]

Julian Boyd and James P Thorne. 1969. The semantics of modal verbs.Journal of linguistics5, 1 (1969), 57–74

work page 1969
[5]

Karl Branting, Bradford Brown, Chris Giannella, James Van Guilder, Jeff Harrold, Sarah Howell, and Jason R Baron. 2025. Decision support for detecting sensitive text in government records.Artificial Intelligence & Law33, 1 (2025)

work page 2025
[6]

And I Think That Is a Very Straightforward Way of Dealing With It

Anita Fetzer. 2008. “And I Think That Is a Very Straightforward Way of Dealing With It” The Communicative Function of Cognitive Verbs in Political Discourse. Journal of Language and Social Psychology27, 4 (2008), 384–396

work page 2008
[7]

Mark H Grunewald. 1998. E-FOIA and the mother of all complaints: Information delivery and delay reduction.Admin. L. Rev.50 (1998), 345

work page 1998
[8]

George Lakoff. 1966. Stative adjectives and verbs in English. (1966)

work page 1966
[9]

Graham McDonald, Nicolás García-Pedrajas, Craig Macdonald, and Iadh Ounis

work page
[10]

InProceedings of the 40th international ACM SIGIR conference on research and development in information retrieval

A study of SVM kernel functions for sensitivity classification ensembles with POS sequences. InProceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 1097–1100

work page
[11]

Graham McDonald, Craig Macdonald, and Iadh Ounis. 2015. Using part-of-speech n-grams for sensitive-text classification. InProceedings of the 2015 International conference on the theory of information retrieval. 381–384

work page 2015
[12]

Graham McDonald, Craig Macdonald, and Iadh Ounis. 2017. Enhancing sensitiv- ity classification with semantic features using word embeddings. InEuropean Conference on Information Retrieval. Springer, 450–463

work page 2017
[13]

Graham McDonald, Craig Macdonald, and Iadh Ounis. 2018. Active learning strategies for technology assisted sensitivity review. InEuropean Conference on Information Retrieval. Springer, 439–453

work page 2018
[14]

Graham Mcdonald, Craig Macdonald, and Iadh Ounis. 2020. How the accuracy and confidence of sensitivity classification affects digital sensitivity review.ACM Transactions on Information Systems (TOIS)39, 1 (2020), 1–34

work page 2020
[15]

Graham McDonald, Craig Macdonald, Iadh Ounis, and Timothy Gollins. 2014. Towards a classifier for digital sensitivity review. InEuropean Conference on Information Retrieval. Springer, 500–506

work page 2014
[16]

Jack McKechnie. 2024. Cascading Ranking Pipelines for Sensitivity-Aware Search. InEuropean Conference on Information Retrieval. Springer, 331–333

work page 2024
[17]

Jack McKechnie, Graham McDonald, and Craig Macdonald. 2024. Bi-Objective Negative Sampling for Sensitivity-Aware Search. InProceedings of the 47th In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval. 2296–2300

work page 2024
[18]

Hitarth Narvala, Graham McDonald, and Iadh Ounis. 2021. RelDiff: Enriching knowledge graph relation representations for sensitivity classification. InFindings of the Association for Computational Linguistics: EMNLP 2021. 3671–3681

work page 2021
[19]

Hitarth Narvala, Graham Mcdonald, and Iadh Ounis. 2022. The role of latent se- mantic categories and clustering in enhancing the efficiency of human sensitivity review. InProceedings of the 2022 Conference on Human Information Interaction and Retrieval. 56–66

work page 2022
[20]

Hitarth Narvala, Graham McDonald, and Iadh Ounis. 2022. Sensitivity review of large collections by identifying and prioritising coherent documents groups. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4931–4935

work page 2022
[21]

Douglas W Oard, Katie Shilton, and Jimmy Lin. 2016. Evaluating Search Among Secrets.. InEVIA@ NTCIR

work page 2016
[22]

Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen. ai/blog?id=qwen3.5

work page 2026
[23]

Mahmoud F Sayed and Douglas W Oard. 2019. Jointly modeling relevance and sensitivity for search among sensitive content. InProceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 615–624

work page 2019
[24]

Geoff Thompson and Ye Yiyun. 1991. Evaluation in the reporting verbs used in academic papers.Applied linguistics12, 4 (1991), 365–382

work page 1991
[25]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

work page 2022
[26]

Serv Wiemers, Guido Enthoven, Maarten Marx, and Linde Berger. 2026. EEN ZWALUW: ANALYSE AFHANDELING WOO-VERZOEKEN 2025

work page 2026

[1] [1]

Jason R Baron, Nathaniel W Rollings, and Douglas W Oard. 2023. Using ChatGPT for the FOIA Exemption 5 Deliberative Process Privilege.. InLegalAIIA@ ICAIL. 32–48

work page 2023

[2] [2]

Jason R Baron, Mahmoud F Sayed, and Douglas W Oard. 2022. Providing more efficient access to government records: a use case involving application of ma- chine learning to improve FOIA Review for the deliberative process privilege. ACM Journal on Computing and Cultural Heritage (JOCCH)15, 1 (2022), 1–19

work page 2022

[3] [3]

Giacomo Berardi, Andrea Esuli, Craig Macdonald, Iadh Ounis, and Fabrizio Sebastiani. 2015. Semi-automated text classification for sensitivity identification. InProceedings of the 24th ACM International on Conference on Information and Knowledge Management. 1711–1714

work page 2015

[4] [4]

Julian Boyd and James P Thorne. 1969. The semantics of modal verbs.Journal of linguistics5, 1 (1969), 57–74

work page 1969

[5] [5]

Karl Branting, Bradford Brown, Chris Giannella, James Van Guilder, Jeff Harrold, Sarah Howell, and Jason R Baron. 2025. Decision support for detecting sensitive text in government records.Artificial Intelligence & Law33, 1 (2025)

work page 2025

[6] [6]

And I Think That Is a Very Straightforward Way of Dealing With It

Anita Fetzer. 2008. “And I Think That Is a Very Straightforward Way of Dealing With It” The Communicative Function of Cognitive Verbs in Political Discourse. Journal of Language and Social Psychology27, 4 (2008), 384–396

work page 2008

[7] [7]

Mark H Grunewald. 1998. E-FOIA and the mother of all complaints: Information delivery and delay reduction.Admin. L. Rev.50 (1998), 345

work page 1998

[8] [8]

George Lakoff. 1966. Stative adjectives and verbs in English. (1966)

work page 1966

[9] [9]

Graham McDonald, Nicolás García-Pedrajas, Craig Macdonald, and Iadh Ounis

work page

[10] [10]

InProceedings of the 40th international ACM SIGIR conference on research and development in information retrieval

A study of SVM kernel functions for sensitivity classification ensembles with POS sequences. InProceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. 1097–1100

work page

[11] [11]

Graham McDonald, Craig Macdonald, and Iadh Ounis. 2015. Using part-of-speech n-grams for sensitive-text classification. InProceedings of the 2015 International conference on the theory of information retrieval. 381–384

work page 2015

[12] [12]

Graham McDonald, Craig Macdonald, and Iadh Ounis. 2017. Enhancing sensitiv- ity classification with semantic features using word embeddings. InEuropean Conference on Information Retrieval. Springer, 450–463

work page 2017

[13] [13]

Graham McDonald, Craig Macdonald, and Iadh Ounis. 2018. Active learning strategies for technology assisted sensitivity review. InEuropean Conference on Information Retrieval. Springer, 439–453

work page 2018

[14] [14]

Graham Mcdonald, Craig Macdonald, and Iadh Ounis. 2020. How the accuracy and confidence of sensitivity classification affects digital sensitivity review.ACM Transactions on Information Systems (TOIS)39, 1 (2020), 1–34

work page 2020

[15] [15]

Graham McDonald, Craig Macdonald, Iadh Ounis, and Timothy Gollins. 2014. Towards a classifier for digital sensitivity review. InEuropean Conference on Information Retrieval. Springer, 500–506

work page 2014

[16] [16]

Jack McKechnie. 2024. Cascading Ranking Pipelines for Sensitivity-Aware Search. InEuropean Conference on Information Retrieval. Springer, 331–333

work page 2024

[17] [17]

Jack McKechnie, Graham McDonald, and Craig Macdonald. 2024. Bi-Objective Negative Sampling for Sensitivity-Aware Search. InProceedings of the 47th In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval. 2296–2300

work page 2024

[18] [18]

Hitarth Narvala, Graham McDonald, and Iadh Ounis. 2021. RelDiff: Enriching knowledge graph relation representations for sensitivity classification. InFindings of the Association for Computational Linguistics: EMNLP 2021. 3671–3681

work page 2021

[19] [19]

Hitarth Narvala, Graham Mcdonald, and Iadh Ounis. 2022. The role of latent se- mantic categories and clustering in enhancing the efficiency of human sensitivity review. InProceedings of the 2022 Conference on Human Information Interaction and Retrieval. 56–66

work page 2022

[20] [20]

Hitarth Narvala, Graham McDonald, and Iadh Ounis. 2022. Sensitivity review of large collections by identifying and prioritising coherent documents groups. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4931–4935

work page 2022

[21] [21]

Douglas W Oard, Katie Shilton, and Jimmy Lin. 2016. Evaluating Search Among Secrets.. InEVIA@ NTCIR

work page 2016

[22] [22]

Qwen Team. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen. ai/blog?id=qwen3.5

work page 2026

[23] [23]

Mahmoud F Sayed and Douglas W Oard. 2019. Jointly modeling relevance and sensitivity for search among sensitive content. InProceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 615–624

work page 2019

[24] [24]

Geoff Thompson and Ye Yiyun. 1991. Evaluation in the reporting verbs used in academic papers.Applied linguistics12, 4 (1991), 365–382

work page 1991

[25] [25]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

work page 2022

[26] [26]

Serv Wiemers, Guido Enthoven, Maarten Marx, and Linde Berger. 2026. EEN ZWALUW: ANALYSE AFHANDELING WOO-VERZOEKEN 2025

work page 2026