Retrieval Augmented Classification for Confidential Documents
Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3
The pith
Retrieval augmented classification keeps confidential-document accuracy stable on unbalanced data by grounding decisions in an external vector store rather than model weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RAC classifies confidential documents by retrieving similar examples from an external vector store and prompting a language model with those examples, achieving comparable accuracy to fine-tuning on balanced data and greater stability on unbalanced data. On the WikiLeaks corpus RAC reaches about 96 percent accuracy on both the original unbalanced set and the augmented balanced set, and up to 94 percent F1 with suitable prompting, while fine-tuning drops from 90 percent F1 on the balanced set to 88 percent F1 on the unbalanced set. Because decisions remain grounded in the store rather than in learned weights, the method avoids embedding sensitive content, reduces sensitivity to label skew, 0.
What carries the argument
Retrieval-augmented classification pipeline that retrieves similar passages from an external vector store via similarity matching and feeds them to a prompted language model for the final confidential/non-confidential decision.
If this is right
- New documents can be added to the classifier immediately by reindexing the vector store without retraining the model.
- Label skew in the training distribution affects RAC less than it affects fine-tuning because retrieval depends on similarity rather than class frequency.
- Sensitive content remains outside model weights, lowering the risk of parameter-level leakage in regulated settings.
- Context-length constraints on the base model become less binding because only the retrieved passages plus a short prompt need to fit.
Where Pith is reading between the lines
- The same retrieval-plus-prompt pattern could be tested on other high-stakes unbalanced classification tasks such as medical-record triage or financial-compliance screening.
- If the vector store is itself governed by access-control rules, the overall system may satisfy stricter data-residency requirements than any weight-based alternative.
- Prompt engineering that selects which retrieved passages to include could be further optimized to trade off accuracy against token budget in long-document settings.
Load-bearing premise
The external vector store must return sufficiently relevant passages to support accurate classification even when the prompt cannot contain the full sensitive document.
What would settle it
RAC accuracy or F1 falling materially below fine-tuning performance on a fresh unbalanced confidential corpus drawn from a different domain would falsify the stability advantage.
Figures
read the original abstract
Unauthorized disclosure of confidential documents demands robust, low-leakage classification. In real work environments, there is a lot of inflow and outflow of documents. To continuously update knowledge, we propose a methodology for classifying confidential documents using Retrieval Augmented Classification (RAC). To confirm this effectiveness, we compare RAC and supervised fine tuning (FT) on the WikiLeaks US Diplomacy corpus under realistic sequence-length constraints. On balanced data, RAC matches FT. On unbalanced data, RAC is more stable while delivering comparable performance--about 96% Accuracy on both the original (unbalanced) and augmented (balanced) sets, and up to 94% F1 with proper prompting--whereas FT attains 90% F1 trained on the augmented, balanced set but drops to 88% F1 trained on the original, unbalanced set. When robust augmentation is infeasible, RAC provides a practical, security-preserving path to strong classification by keeping sensitive content out of model weights and under your control, and it remains robust as real-world conditions change in class balance, data, context length, or governance requirements. Because RAC grounds decisions in an external vector store with similarity matching, it is less sensitive to label skew, reduces parameter-level leakage, and can incorporate new data immediately via reindexing--a difficult step for FT, which typically requires retraining. The contributions of this paper are threefold: first, a RAC-based classification pipeline and evaluation recipe; second, a controlled study that isolates class imbalance and context-length effects for FT versus RAC in confidential-document grading; and third, actionable guidance on RAC design patterns for governed deployments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Retrieval Augmented Classification (RAC) for confidential document classification, comparing it to supervised fine-tuning (FT) on the WikiLeaks US Diplomacy corpus under realistic sequence-length constraints. It claims RAC matches FT performance on balanced data and is more stable on unbalanced data, achieving ~96% accuracy on both the original (unbalanced) and augmented (balanced) sets with up to 94% F1 via proper prompting, while FT reaches 90% F1 on balanced data but drops to 88% F1 on unbalanced data. RAC is positioned as less sensitive to label skew due to grounding in an external vector store, reducing parameter leakage, and allowing immediate updates via reindexing without retraining.
Significance. If the empirical results hold and the underlying mechanism is isolated, the work provides a practical, security-preserving alternative for document classification in imbalanced, dynamic real-world environments where retraining is costly and data leakage must be minimized.
major comments (1)
- [Abstract] Abstract: The central claim that RAC 'is less sensitive to label skew' because it 'grounds decisions in an external vector store with similarity matching' is load-bearing for the reported stability advantage (96% acc / 94% F1 on unbalanced data vs. FT's drop to 88% F1). However, no per-class retrieval metrics, ablation on k, or embedding-model analysis is referenced to confirm that minority-class documents are reliably surfaced for minority queries; standard cosine similarity on unbalanced corpora is known to exhibit majority-class bias, so the performance gap could stem from prompting or other factors rather than the claimed retrieval mechanism.
minor comments (1)
- [Abstract] Abstract: No details are provided on experimental setup, error bars, statistical significance, exact RAC/prompting implementations, or sequence-length constraints, which limits verification of the stated performance figures.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comment below and have incorporated revisions to strengthen the supporting analysis for our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that RAC 'is less sensitive to label skew' because it 'grounds decisions in an external vector store with similarity matching' is load-bearing for the reported stability advantage (96% acc / 94% F1 on unbalanced data vs. FT's drop to 88% F1). However, no per-class retrieval metrics, ablation on k, or embedding-model analysis is referenced to confirm that minority-class documents are reliably surfaced for minority queries; standard cosine similarity on unbalanced corpora is known to exhibit majority-class bias, so the performance gap could stem from prompting or other factors rather than the claimed retrieval mechanism.
Authors: We agree that the claim of reduced sensitivity to label skew is central and that the manuscript would benefit from explicit supporting analysis of the retrieval mechanism. The current version does not include per-class retrieval metrics, a k-ablation, or embedding-model analysis. In the revised manuscript we have added these: (1) per-class retrieval recall and precision on the unbalanced WikiLeaks set, showing minority-class documents are surfaced at rates sufficient to support the observed F1 stability; (2) an ablation over k=1..10 demonstrating that the 96% accuracy / 94% F1 advantage persists across reasonable k values and is not an artifact of a single prompting configuration; and (3) a brief analysis of the embedding model together with a discussion of how the combination of similarity retrieval and task-specific prompting reduces majority-class bias relative to pure fine-tuning. These additions isolate the contribution of the external vector store and address the possibility that prompting alone drives the gap. revision: yes
Circularity Check
Empirical comparison with no derivations or self-referential reductions
full rationale
The paper is an empirical study that reports accuracy and F1 scores for RAC versus fine-tuning on balanced and unbalanced splits of the WikiLeaks corpus. No equations, parameter fits, or derivations are present. Claims about robustness to label skew are justified by direct experimental measurements rather than by reducing to self-defined inputs or self-citations. No load-bearing steps match any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
U. I. T. R. Center, “2025 h1 data breach report,” Identity Theft Resource Center, Tech. Rep., 2025
work page 2025
-
[2]
Cost of a data breach report 2024
IBM, “Cost of a data breach report 2024” IBM, Tech. Rep., 2024. [Online]. Available 6 Yeseul E. Chang et al.: Retrieval Augmented Classification for Confidential Documents
work page 2024
-
[3]
What is Document Classification?
S. Kumar, “What is Document Classification?” Library & Information Science Education Network, 2014
work page 2014
-
[4]
Data protection challenges in the processing of sensitive data
HERATH, H. M. S. S., et al. “Data protection challenges in the processing of sensitive data” In: Data Protection: The Wake of AI and Machine Learning . Cham: Springer Nature Switzerland, 2024. p. 155 - 179
work page 2024
-
[5]
Large Language Models for Imbalanced Classification: Diversity makes the difference
NGUYEN, Dang, et al. “Large Language Models for Imbalanced Classification: Diversity makes the difference” arXiv preprint arXiv:2510.09783, 2025
work page internal anchor Pith review arXiv 2025
-
[6]
Understanding the effects of language-specific class imbalance in multilingual fine -tuning
JUNG, Vincent; VAN DER PLAS, Lonneke. “Understanding the effects of language-specific class imbalance in multilingual fine -tuning” arXiv preprint arXiv:2402.13016, 2024
-
[7]
On protecting the data privacy of Large Language Models (LLMs) and LLM agents:A literature review
YAN, Biwei, et al. “On protecting the data privacy of Large Language Models (LLMs) and LLM agents:A literature review” High- Confidence Computing, 100300, 2025
work page 2025
-
[8]
Training Compute-Optimal Large Language Models
HOFFMANN, Jordan, et al. “Training compute-optimal large language models” arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Retrieval - augmented generation for knowledge - intensive NLP tasks
LEWIS, Patrick, et al. “Retrieval - augmented generation for knowledge - intensive NLP tasks.” Advances in neural information processing systems , 33: 9459 - 9474,2020
work page 2020
-
[10]
Automatic La- belling with Open-source LLMs using Dynamic Label Schema Integration,
WALSHE, Thomas, et al. “Automatic labelling with open -source llms using dynamic label schema integration” a rXiv preprint arXiv:2501.12332, 2025
-
[11]
Imbalanced dataset classification and solutions: a review
RAMYACHITRA, Duraisamy; MANIKANDAN, Parasuraman. “Imbalanced dataset classification and solutions: a review” International Journal of Computing and Business Research (IJCBR), 5.4: 1-29, 2014
work page 2014
-
[12]
Measuring the Value Of Information -An Asset Valuation Approach
MOODY, Daniel L.; WALSH, Peter. “Measuring the Value Of Information -An Asset Valuation Approach” In: ECIS . p. 496-512,1999
work page 1999
-
[13]
Prompt -Based Generation Strategy for Imbalanced Information Security Rating Dataset Augmentation
HAN, Yuna, et al. “Prompt -Based Generation Strategy for Imbalanced Information Security Rating Dataset Augmentation” In: 2024 International Conference on AI x Data and Knowledge Engineering (AIxDKE). IEEE , p. 117 -122, 2024
work page 2024
-
[14]
DISC: A Dataset for Information Security Classification
BASS, Elijah; ALBANESE, Massimiliano; ZAMPIERI, Marcos. “DISC: A Dataset for Information Security Classification” In: SECRYPT. 2024
work page 2024
-
[15]
RAM, Ori, et al. “In -context retrieval - augmented language models.“ Transactions of the Association for Computational Linguistics, 11: 1316-1331, 2023
work page 2023
-
[16]
YU, Guoxin, et al. “Retrieval -augmented few-shot text classification“, In: Findings of the Association for Computational Linguistics: EMNLP, p. 6721-6735, 2023
work page 2023
-
[17]
Retrieval augmented classification for long-tail visual recognition
LONG, Alexander, et al. “Retrieval augmented classification for long-tail visual recognition” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 6959 -6969, 2022
work page 2022
- [18]
-
[19]
Lu, Y., Wikileaks Cable Classifier, Kaggle,
-
[20]
Available: https://www.kaggle.com/code/yilouislu/wi kileaks-cable-classifier
[Online]. Available: https://www.kaggle.com/code/yilouislu/wi kileaks-cable-classifier
- [21]
-
[22]
MALKOV, Yu A.; YASHUNIN, Dmitry A. “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs”, IEEE transactions on pattern analysis and machine Intelligence, 42.4: 824-836. 2018,
work page 2018
-
[23]
CHEN, Jianlyu, et al. “M3 -embedding: Multi-linguality, multi-functionality, multi- granularity text embeddings through self - knowledge distillation”, In: Findings of the Association for Computational Linguistics ACL 2024.. p. 2318-2335, 2024
work page 2024
-
[24]
[Online], Available: https://github.com/unslothai/unsloth 7 Appendix
Unsloth AI, Unsloth: Faster Llama, Mistral, Gemma, Phi -3 Finetuning, GitHub, 2024. [Online], Available: https://github.com/unslothai/unsloth 7 Appendix
work page 2024
-
[25]
Result of Statistical Tests Model Macro F1 95% CI p (vs FT- Orig) p (vs FT- Aug) p (vs RAC-k0) FT (Original) 0.8842 [0.8573, 0.8993] N/A 0.0004 N/A FT (Augmented) 0.9099 [0.8871, 0.9230] 0.0004 N/A N/A RAC-Orig (k=0) 0.9255 [0.9134, 0.9369] 0.0039 9.83E-08 N/A RAC-Aug (k=0) 0.9321 [0.9183, 0.9452] 0.3397 0.1588 7.35E-07 RAC-Orig (k=3) 0.9327 [0.9195, 0.94...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.