pith. sign in

arxiv: 2604.08628 · v1 · submitted 2026-04-09 · 💻 cs.CR · cs.AI· cs.IR

Retrieval Augmented Classification for Confidential Documents

Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.IR
keywords confidential document classificationretrieval augmented generationclass imbalancedata leakage preventionfine-tuning comparisonvector store retrievalWikiLeaks corpus
0
0 comments X

The pith

Retrieval augmented classification keeps confidential-document accuracy stable on unbalanced data by grounding decisions in an external vector store rather than model weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that retrieval-augmented classification (RAC) matches the performance of fine-tuning on balanced confidential-document sets while remaining stable when class distributions are unbalanced. It shows this by comparing both methods on the WikiLeaks US Diplomacy corpus under realistic sequence-length limits, where RAC holds roughly 96 percent accuracy and up to 94 percent F1 across original and augmented data. The central mechanism is that RAC retrieves relevant passages from an external vector store at inference time, so sensitive content never enters the model parameters. This design also permits immediate incorporation of new documents via reindexing instead of retraining. A sympathetic reader would care because real document flows are rarely balanced and because keeping data out of weights reduces leakage risk in governed environments.

Core claim

RAC classifies confidential documents by retrieving similar examples from an external vector store and prompting a language model with those examples, achieving comparable accuracy to fine-tuning on balanced data and greater stability on unbalanced data. On the WikiLeaks corpus RAC reaches about 96 percent accuracy on both the original unbalanced set and the augmented balanced set, and up to 94 percent F1 with suitable prompting, while fine-tuning drops from 90 percent F1 on the balanced set to 88 percent F1 on the unbalanced set. Because decisions remain grounded in the store rather than in learned weights, the method avoids embedding sensitive content, reduces sensitivity to label skew, 0.

What carries the argument

Retrieval-augmented classification pipeline that retrieves similar passages from an external vector store via similarity matching and feeds them to a prompted language model for the final confidential/non-confidential decision.

If this is right

  • New documents can be added to the classifier immediately by reindexing the vector store without retraining the model.
  • Label skew in the training distribution affects RAC less than it affects fine-tuning because retrieval depends on similarity rather than class frequency.
  • Sensitive content remains outside model weights, lowering the risk of parameter-level leakage in regulated settings.
  • Context-length constraints on the base model become less binding because only the retrieved passages plus a short prompt need to fit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-plus-prompt pattern could be tested on other high-stakes unbalanced classification tasks such as medical-record triage or financial-compliance screening.
  • If the vector store is itself governed by access-control rules, the overall system may satisfy stricter data-residency requirements than any weight-based alternative.
  • Prompt engineering that selects which retrieved passages to include could be further optimized to trade off accuracy against token budget in long-document settings.

Load-bearing premise

The external vector store must return sufficiently relevant passages to support accurate classification even when the prompt cannot contain the full sensitive document.

What would settle it

RAC accuracy or F1 falling materially below fine-tuning performance on a fresh unbalanced confidential corpus drawn from a different domain would falsify the stability advantage.

Figures

Figures reproduced from arXiv: 2604.08628 by Byunghoon Oh, Jaewoo Lee, Rahul Kailasa, Simon Shim, Yeseul E. Chang.

Figure 1
Figure 1. Figure 1: presents a RAC-based pipeline that combines dense vector retrieval with LLM inference for confidential documents classification. Training documents are embedded in batches of 64 and indexed in ChromaDB [20], which uses cosine distance with an HNSW index [21]; each record stores metadata—classification label, provenance attributes, document-length statistics, and source information—to enable quality-aware f… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy and F1 score of Different Strategies on the Original Imbalanced Dataset prompt construction, and single-pass LLM inference. Together, these results position RAC as a practical, robust, and time-efficient approach for confidential document classification. 4.1 Experimental Setup All experiments ran on a single NVIDIA A100- SXM4-40GB with PyTorch 2.8.0 (CUDA 12.6) and Unsloth optimizations [23]. GPT-… view at source ↗
read the original abstract

Unauthorized disclosure of confidential documents demands robust, low-leakage classification. In real work environments, there is a lot of inflow and outflow of documents. To continuously update knowledge, we propose a methodology for classifying confidential documents using Retrieval Augmented Classification (RAC). To confirm this effectiveness, we compare RAC and supervised fine tuning (FT) on the WikiLeaks US Diplomacy corpus under realistic sequence-length constraints. On balanced data, RAC matches FT. On unbalanced data, RAC is more stable while delivering comparable performance--about 96% Accuracy on both the original (unbalanced) and augmented (balanced) sets, and up to 94% F1 with proper prompting--whereas FT attains 90% F1 trained on the augmented, balanced set but drops to 88% F1 trained on the original, unbalanced set. When robust augmentation is infeasible, RAC provides a practical, security-preserving path to strong classification by keeping sensitive content out of model weights and under your control, and it remains robust as real-world conditions change in class balance, data, context length, or governance requirements. Because RAC grounds decisions in an external vector store with similarity matching, it is less sensitive to label skew, reduces parameter-level leakage, and can incorporate new data immediately via reindexing--a difficult step for FT, which typically requires retraining. The contributions of this paper are threefold: first, a RAC-based classification pipeline and evaluation recipe; second, a controlled study that isolates class imbalance and context-length effects for FT versus RAC in confidential-document grading; and third, actionable guidance on RAC design patterns for governed deployments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes Retrieval Augmented Classification (RAC) for confidential document classification, comparing it to supervised fine-tuning (FT) on the WikiLeaks US Diplomacy corpus under realistic sequence-length constraints. It claims RAC matches FT performance on balanced data and is more stable on unbalanced data, achieving ~96% accuracy on both the original (unbalanced) and augmented (balanced) sets with up to 94% F1 via proper prompting, while FT reaches 90% F1 on balanced data but drops to 88% F1 on unbalanced data. RAC is positioned as less sensitive to label skew due to grounding in an external vector store, reducing parameter leakage, and allowing immediate updates via reindexing without retraining.

Significance. If the empirical results hold and the underlying mechanism is isolated, the work provides a practical, security-preserving alternative for document classification in imbalanced, dynamic real-world environments where retraining is costly and data leakage must be minimized.

major comments (1)
  1. [Abstract] Abstract: The central claim that RAC 'is less sensitive to label skew' because it 'grounds decisions in an external vector store with similarity matching' is load-bearing for the reported stability advantage (96% acc / 94% F1 on unbalanced data vs. FT's drop to 88% F1). However, no per-class retrieval metrics, ablation on k, or embedding-model analysis is referenced to confirm that minority-class documents are reliably surfaced for minority queries; standard cosine similarity on unbalanced corpora is known to exhibit majority-class bias, so the performance gap could stem from prompting or other factors rather than the claimed retrieval mechanism.
minor comments (1)
  1. [Abstract] Abstract: No details are provided on experimental setup, error bars, statistical significance, exact RAC/prompting implementations, or sequence-length constraints, which limits verification of the stated performance figures.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below and have incorporated revisions to strengthen the supporting analysis for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that RAC 'is less sensitive to label skew' because it 'grounds decisions in an external vector store with similarity matching' is load-bearing for the reported stability advantage (96% acc / 94% F1 on unbalanced data vs. FT's drop to 88% F1). However, no per-class retrieval metrics, ablation on k, or embedding-model analysis is referenced to confirm that minority-class documents are reliably surfaced for minority queries; standard cosine similarity on unbalanced corpora is known to exhibit majority-class bias, so the performance gap could stem from prompting or other factors rather than the claimed retrieval mechanism.

    Authors: We agree that the claim of reduced sensitivity to label skew is central and that the manuscript would benefit from explicit supporting analysis of the retrieval mechanism. The current version does not include per-class retrieval metrics, a k-ablation, or embedding-model analysis. In the revised manuscript we have added these: (1) per-class retrieval recall and precision on the unbalanced WikiLeaks set, showing minority-class documents are surfaced at rates sufficient to support the observed F1 stability; (2) an ablation over k=1..10 demonstrating that the 96% accuracy / 94% F1 advantage persists across reasonable k values and is not an artifact of a single prompting configuration; and (3) a brief analysis of the embedding model together with a discussion of how the combination of similarity retrieval and task-specific prompting reduces majority-class bias relative to pure fine-tuning. These additions isolate the contribution of the external vector store and address the possibility that prompting alone drives the gap. revision: yes

Circularity Check

0 steps flagged

Empirical comparison with no derivations or self-referential reductions

full rationale

The paper is an empirical study that reports accuracy and F1 scores for RAC versus fine-tuning on balanced and unbalanced splits of the WikiLeaks corpus. No equations, parameter fits, or derivations are present. Claims about robustness to label skew are justified by direct experimental measurements rather than by reducing to self-defined inputs or self-citations. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an empirical methodology without introducing new mathematical parameters, axioms, or entities.

pith-pipeline@v0.9.0 · 5600 in / 1125 out tokens · 46259 ms · 2026-05-10T17:30:42.140115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 2 internal anchors

  1. [1]

    2025 h1 data breach report,

    U. I. T. R. Center, “2025 h1 data breach report,” Identity Theft Resource Center, Tech. Rep., 2025

  2. [2]

    Cost of a data breach report 2024

    IBM, “Cost of a data breach report 2024” IBM, Tech. Rep., 2024. [Online]. Available 6 Yeseul E. Chang et al.: Retrieval Augmented Classification for Confidential Documents

  3. [3]

    What is Document Classification?

    S. Kumar, “What is Document Classification?” Library & Information Science Education Network, 2014

  4. [4]

    Data protection challenges in the processing of sensitive data

    HERATH, H. M. S. S., et al. “Data protection challenges in the processing of sensitive data” In: Data Protection: The Wake of AI and Machine Learning . Cham: Springer Nature Switzerland, 2024. p. 155 - 179

  5. [5]

    Large Language Models for Imbalanced Classification: Diversity makes the difference

    NGUYEN, Dang, et al. “Large Language Models for Imbalanced Classification: Diversity makes the difference” arXiv preprint arXiv:2510.09783, 2025

  6. [6]

    Understanding the effects of language-specific class imbalance in multilingual fine -tuning

    JUNG, Vincent; VAN DER PLAS, Lonneke. “Understanding the effects of language-specific class imbalance in multilingual fine -tuning” arXiv preprint arXiv:2402.13016, 2024

  7. [7]

    On protecting the data privacy of Large Language Models (LLMs) and LLM agents:A literature review

    YAN, Biwei, et al. “On protecting the data privacy of Large Language Models (LLMs) and LLM agents:A literature review” High- Confidence Computing, 100300, 2025

  8. [8]

    Training Compute-Optimal Large Language Models

    HOFFMANN, Jordan, et al. “Training compute-optimal large language models” arXiv preprint arXiv:2203.15556, 2022

  9. [9]

    Retrieval - augmented generation for knowledge - intensive NLP tasks

    LEWIS, Patrick, et al. “Retrieval - augmented generation for knowledge - intensive NLP tasks.” Advances in neural information processing systems , 33: 9459 - 9474,2020

  10. [10]

    Automatic La- belling with Open-source LLMs using Dynamic Label Schema Integration,

    WALSHE, Thomas, et al. “Automatic labelling with open -source llms using dynamic label schema integration” a rXiv preprint arXiv:2501.12332, 2025

  11. [11]

    Imbalanced dataset classification and solutions: a review

    RAMYACHITRA, Duraisamy; MANIKANDAN, Parasuraman. “Imbalanced dataset classification and solutions: a review” International Journal of Computing and Business Research (IJCBR), 5.4: 1-29, 2014

  12. [12]

    Measuring the Value Of Information -An Asset Valuation Approach

    MOODY, Daniel L.; WALSH, Peter. “Measuring the Value Of Information -An Asset Valuation Approach” In: ECIS . p. 496-512,1999

  13. [13]

    Prompt -Based Generation Strategy for Imbalanced Information Security Rating Dataset Augmentation

    HAN, Yuna, et al. “Prompt -Based Generation Strategy for Imbalanced Information Security Rating Dataset Augmentation” In: 2024 International Conference on AI x Data and Knowledge Engineering (AIxDKE). IEEE , p. 117 -122, 2024

  14. [14]

    DISC: A Dataset for Information Security Classification

    BASS, Elijah; ALBANESE, Massimiliano; ZAMPIERI, Marcos. “DISC: A Dataset for Information Security Classification” In: SECRYPT. 2024

  15. [15]

    “In -context retrieval - augmented language models.“ Transactions of the Association for Computational Linguistics, 11: 1316-1331, 2023

    RAM, Ori, et al. “In -context retrieval - augmented language models.“ Transactions of the Association for Computational Linguistics, 11: 1316-1331, 2023

  16. [16]

    “Retrieval -augmented few-shot text classification“, In: Findings of the Association for Computational Linguistics: EMNLP, p

    YU, Guoxin, et al. “Retrieval -augmented few-shot text classification“, In: Findings of the Association for Computational Linguistics: EMNLP, p. 6721-6735, 2023

  17. [17]

    Retrieval augmented classification for long-tail visual recognition

    LONG, Alexander, et al. “Retrieval augmented classification for long-tail visual recognition” In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 6959 -6969, 2022

  18. [18]

    [Online]

    WikiLeaks, United States diplomatic cables leak (Cablegate), 2010. [Online]. Available: https://wikileaks.org/plusd/

  19. [19]

    Lu, Y., Wikileaks Cable Classifier, Kaggle,

  20. [20]

    Available: https://www.kaggle.com/code/yilouislu/wi kileaks-cable-classifier

    [Online]. Available: https://www.kaggle.com/code/yilouislu/wi kileaks-cable-classifier

  21. [21]

    [Online]

    Chroma, ChromaDB: The open -source embedding database, 2023. [Online]. Available: https://www.trychroma.com/

  22. [22]

    Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs

    MALKOV, Yu A.; YASHUNIN, Dmitry A. “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs”, IEEE transactions on pattern analysis and machine Intelligence, 42.4: 824-836. 2018,

  23. [23]

    M3 -embedding: Multi-linguality, multi-functionality, multi- granularity text embeddings through self - knowledge distillation

    CHEN, Jianlyu, et al. “M3 -embedding: Multi-linguality, multi-functionality, multi- granularity text embeddings through self - knowledge distillation”, In: Findings of the Association for Computational Linguistics ACL 2024.. p. 2318-2335, 2024

  24. [24]

    [Online], Available: https://github.com/unslothai/unsloth 7 Appendix

    Unsloth AI, Unsloth: Faster Llama, Mistral, Gemma, Phi -3 Finetuning, GitHub, 2024. [Online], Available: https://github.com/unslothai/unsloth 7 Appendix

  25. [25]

    Result of Statistical Tests Model Macro F1 95% CI p (vs FT- Orig) p (vs FT- Aug) p (vs RAC-k0) FT (Original) 0.8842 [0.8573, 0.8993] N/A 0.0004 N/A FT (Augmented) 0.9099 [0.8871, 0.9230] 0.0004 N/A N/A RAC-Orig (k=0) 0.9255 [0.9134, 0.9369] 0.0039 9.83E-08 N/A RAC-Aug (k=0) 0.9321 [0.9183, 0.9452] 0.3397 0.1588 7.35E-07 RAC-Orig (k=3) 0.9327 [0.9195, 0.94...