pith. sign in

arxiv: 2605.16535 · v1 · pith:HHWZIQSQnew · submitted 2026-05-15 · 💻 cs.IR · cs.AI

RAPT: Retrieval-Augmented Post-hoc Thresholding for Multi-Label Classification

Pith reviewed 2026-05-19 21:28 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords multi-label classificationpost-hoc thresholdingretrieval augmentationdocument classificationlabel set selectionindustrial pipelinesmetric learning
0
0 comments X

The pith

RAPT adapts label selection thresholds by retrieving similar past documents and aggregating their outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Industrial multi-label document classification relies on scoring candidate labels and then choosing a threshold to form each document's label set. Fixed global or label-wise thresholds often fail under OCR noise, label imbalance, varying label counts per document, and changing formats. RAPT solves this by treating past documents as cases: for a new query it retrieves similar documents in the classifier's representation space and adapts the threshold using the outcomes observed in those cases. A sympathetic reader cares because the choice of labels directly determines the accuracy of all downstream extraction and the amount of manual verification required. The method works as a model-agnostic post-hoc wrapper on any predictor that supplies both representations and per-label scores.

Core claim

RAPT is a deployment-oriented retrieval-augmented score thresholding wrapper. For each query document, given a classifier's score vector, it retrieves similar document thresholding situations and adapts the query's label set selection threshold by locally aggregating neighbour solutions such as average label count or cutoff calibration. This post-hoc adaptation improves label set selection without retraining the underlying classifier.

What carries the argument

Retrieval of similar document thresholding situations (cases) from the classifier's representation space, followed by local aggregation of their outcomes to adapt the current threshold.

If this is right

  • RAPT consistently outperforms global and label-wise static thresholding baselines on both public benchmarks and industrial data.
  • Best results occur when RAPT is paired with metric learners, reaching 0.87 Macro-F1 in the industrial setting.
  • Transformer-based models with RAPT average 0.775 Macro-F1 and outperform few-shot LLM baselines by a factor of two while using far less inference time and memory.
  • The wrapper can be applied to any model that outputs both document representations for similarity search and per-label confidence scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Representation spaces trained for classification appear to encode enough structure that nearest-neighbor documents also share useful thresholding behavior.
  • Larger case bases of past documents could make the adaptation more robust as document formats continue to evolve.
  • The same retrieval-plus-aggregation pattern may apply to other post-hoc calibration problems where a single global rule is insufficient.

Load-bearing premise

Documents retrieved by similarity on the classifier's representation space will have thresholding situations whose outcomes are relevant and transferable to the query document's optimal label set.

What would settle it

Performance of RAPT falls to or below the static baseline when retrieval is replaced by random selection of past documents or when the adaptation step is disabled.

Figures

Figures reproduced from arXiv: 2605.16535 by Darren Nicol, Ikechukwu Nkisi-Orji, Lasal Jayawardena, Nirmalie Wiratunga.

Figure 1
Figure 1. Figure 1: Industry use case pipeline for document processing and information extraction. Left: a typical document with areas indicated for class label assignment. Top: multi￾label predictions act as routing decisions for downstream extraction tasks. Bottom: typical prediction scores from a backbone model on y-axis, shown relative to class specific optimal thresholds, with x-axis showing the label set. This illustrat… view at source ↗
Figure 2
Figure 2. Figure 2: CBR wrapper components, illustrated against a grey background. Given a query document, the backbone model, f, produces label scores and an embedding, which are used to retrieve similar cases from the casebase. Retrieved predictions and labels are then combined in an adaptation step to produce a locally adjusted prediction, followed by threshold calibration to obtain the final multi label output [PITH_FULL… view at source ↗
Figure 3
Figure 3. Figure 3: Cross-dataset accuracy summary across all models. Bars show the Rapt win rate per dataset, while lines show mean improvement in Macro-F1 and Micro-F1 rela￾tive to the best static baseline. Labels above bars indicate the number of wins out of 25 model configurations (4 metric learners + 7 Transformers x 3 modes). 6 Conclusion This paper addressed a practical problem in industrial multi-label document classi… view at source ↗
read the original abstract

Industrial multi-label document understanding pipelines score candidate labels and threshold or rank them to form a label set per document. This early selection step directly affects the accuracy of downstream information extraction from the document, as well as the associated verification effort. In practice, OCR noise, label imbalance, instance-dependent label cardinality, and asymmetric error costs make global score thresholds brittle and hard to maintain as document formats evolve. We present RAPT, a deployment-oriented retrieval-augmented score thresholding wrapper, applied post-hoc to improve label set selection without retraining the underlying classifier. RAPT is a model-agnostic wrapper: any predictor that provides document representations for similarity search and per label confidence scores can be used, including metric learning encoders and fine-tuned transformer classifiers. For each query document, given a classifier's score vector, RAPT retrieves similar document thresholding situations (cases) and adapts the query's label set selection threshold using their outcomes. The adaptation selects the final label set by locally aggregating neighbour solutions (e.g. average label count, cutoff calibration). Evaluation compared multi-label classifiers (metric learners and transformers) combined with RAPT against global and label-wise thresholding baselines, and against few-shot LLMs. Across an industrial dataset and six public benchmarks, RAPT consistently outperformed global and label-wise static thresholding baselines. In the industrial setting, RAPT achieved its best predictive performance with metric learners, reaching 0.87 Macro-F1, while fine-tuned transformer variants on average achieved 0.775 Macro-F1, outperforming fewshot LLM baselines (K = 5) by 2x and requiring at least 115x less inference time and 13.5x less GPU memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces RAPT, a model-agnostic retrieval-augmented wrapper for post-hoc thresholding in multi-label classification. Given a classifier's score vector and document representation, RAPT retrieves similar past documents in embedding space and adapts the query threshold by locally aggregating neighbor outcomes such as average label count or cutoff calibration. It is evaluated against global and label-wise static baselines plus few-shot LLMs on one industrial dataset and six public benchmarks, reporting consistent gains and a peak of 0.87 Macro-F1 with metric learners in the industrial setting.

Significance. If the central claim holds, the work offers practical value for industrial document pipelines by enabling instance-adaptive thresholding without retraining or heavy inference costs. The efficiency advantage over few-shot LLMs (115x less time, 13.5x less memory) is a clear strength, and the model-agnostic design broadens applicability to both metric learners and transformers.

major comments (2)
  1. [Experiments] The evaluation reports consistent outperformance but supplies no quantitative metrics on retrieval quality, neighbor relevance, aggregation rules, statistical significance, or ablation of the adaptation step. This information is required to verify that gains derive from the retrieval-augmented mechanism rather than other factors.
  2. [Approach] The adaptation step rests on the untested assumption that documents retrieved by similarity in the classifier's representation space share transferable thresholding situations. The motivating factors listed (OCR noise, instance-dependent cardinality, asymmetric costs) precisely suggest that embedding proximity need not correlate with optimal score cutoffs; no section or experiment demonstrates that the chosen similarity metric preserves thresholding-relevant structure.
minor comments (1)
  1. [Abstract] The abstract states 'K = 5' for few-shot LLM baselines without clarifying what K denotes or how the comparison was controlled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify areas where additional evidence can strengthen the claims regarding the retrieval-augmented mechanism. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Experiments] The evaluation reports consistent outperformance but supplies no quantitative metrics on retrieval quality, neighbor relevance, aggregation rules, statistical significance, or ablation of the adaptation step. This information is required to verify that gains derive from the retrieval-augmented mechanism rather than other factors.

    Authors: We agree that these supporting analyses are needed to isolate the contribution of the retrieval component. In the revised manuscript we will add: (i) quantitative retrieval-quality metrics including mean neighbor similarity and label-set overlap between query and neighbors; (ii) an ablation that disables local aggregation and compares performance to the full RAPT pipeline; (iii) statistical significance tests (paired Wilcoxon signed-rank) on the reported F1 improvements across the seven datasets; and (iv) explicit description and sensitivity results for the aggregation rules. These additions will be placed in a new subsection of the experimental evaluation. revision: yes

  2. Referee: [Approach] The adaptation step rests on the untested assumption that documents retrieved by similarity in the classifier's representation space share transferable thresholding situations. The motivating factors listed (OCR noise, instance-dependent cardinality, asymmetric costs) precisely suggest that embedding proximity need not correlate with optimal score cutoffs; no section or experiment demonstrates that the chosen similarity metric preserves thresholding-relevant structure.

    Authors: The concern is well-founded: the motivating factors could in principle decouple embedding proximity from threshold optimality. Our current defense rests on the empirical observation that RAPT yields consistent gains when the same representation space is used for both classification and retrieval, suggesting that the learned embeddings already encode label-relevant structure. Nevertheless, we have not provided a direct diagnostic. In the revision we will insert a new analysis that measures the correlation between neighbor cosine similarity and (a) label-cardinality difference and (b) the difference in per-instance optimal cutoffs derived from held-out validation. We will also discuss the extent to which the classifier's training objective encourages preservation of thresholding-relevant features. Any observed limitations will be acknowledged and listed as future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in RAPT derivation

full rationale

The paper introduces RAPT as a post-hoc, model-agnostic wrapper that retrieves similar documents in representation space and aggregates their thresholding outcomes (e.g., average label count or cutoff calibration) to adapt per-query thresholds. No equations, self-definitional loops, or fitted parameters are described that would make any claimed prediction or result equivalent to its inputs by construction. Performance evaluation is presented as empirical comparison against static baselines and few-shot LLMs on industrial and public datasets, with no reduction to self-citation chains or ansatzes smuggled from prior author work. The central inductive bias (embedding similarity correlating with optimal threshold transfer) is an explicit modeling assumption rather than a derived claim that collapses internally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review limits visibility into exact parameters and assumptions; core premise is that neighbor outcomes transfer to the query.

axioms (1)
  • domain assumption Similarity in document representation space implies similarity in useful thresholding behavior
    Invoked by the retrieval step that selects neighbors for adaptation.
invented entities (1)
  • RAPT wrapper no independent evidence
    purpose: Post-hoc adaptation of label thresholds via neighbor aggregation
    New method introduced to solve the brittle global-threshold problem.

pith-pipeline@v0.9.0 · 5850 in / 1224 out tokens · 40697 ms · 2026-05-19T21:28:08.621601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 5 internal anchors

  1. [1]

    AI Communications7(1), 39–59 (1994)

    Aamodt, A., Plaza, E.: Case-Based Reasoning: Foundational Issues, Methodologi- cal Variations, and System Approaches. AI Communications7(1), 39–59 (1994)

  2. [2]

    In: Proc

    Alsentzer, E., Murphy, J., Boag, W., et al.: Publicly Available Clinical BERT Em- beddings. In: Proc. of the 2nd Clinical Natural Language Processing Workshop. pp. 72–78. ACL (Jun 2019)

  3. [3]

    for Computational Linguistics

    Chalkidis, I., Fergadiotis, E., Malakasiotis, P., et al.: Large-Scale Multi-Label Text ClassificationonEU Legislation.In:Proc.ofthe57thAnnualMeetingofthe Assoc. for Computational Linguistics. pp. 6314–6322. ACL (Jul 2019)

  4. [4]

    In: Findings of the Association for Computational Linguistics: EMNLP 2020

    Chalkidis, I., Fergadiotis, M., Malakasiotis, P., et al.: LEGAL-BERT: The Muppets straight out of Law School. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 2898–2904. ACL (Nov 2020)

  5. [5]

    Artificial Intelligence170(16), 1175–1192 (Nov 2006)

    Craw, S., Wiratunga, N., Rowe, R.C.: Learning adaptation knowledge to improve case-based reasoning. Artificial Intelligence170(16), 1175–1192 (Nov 2006)

  6. [6]

    In: Proc

    Devlin, J., Chang, M.W., Lee, K., et al.: BERT: Pre-training of Deep Bidirec- tional Transformers for Language Understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. pp. 4171–4186. ACL (Jun 2019)

  7. [7]

    Machine Learning73(2), 133–153 (Nov 2008)

    Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., et al.: Multilabel classification via calibrated label ranking. Machine Learning73(2), 133–153 (Nov 2008)

  8. [8]

    ACM Transactions on Computing for Healthcare3(1), 1–23 (Jan 2022)

    Gu, Y., Tinn, R., Cheng, H., et al.: Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Transactions on Computing for Healthcare3(1), 1–23 (Jan 2022)

  9. [9]

    In: Proc

    He, P., Gao, J., Chen, W.: DeBERTaV3: Improving DeBERTa using ELECTRA- Style Pre-Training with Gradient-Disentangled Embedding Sharing. In: Proc. of the 11th Int. Conf. on Learning Representations (2023) 16 L. Jayawardena et al

  10. [10]

    In: SIGIR ’94, pp

    Hersh, W., Buckley, C., Leone, T.J., et al.: OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: SIGIR ’94, pp. 192–

  11. [11]

    Springer London (1994)

  12. [12]

    In: Proc

    Jain, H., Prabhu, Y., Varma, M.: Extreme Multi-label Loss Functions for Recom- mendation, Tagging, Ranking & Other Missing Label Applications. In: Proc. of the 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining. pp. 935–944. ACM (Aug 2016)

  13. [13]

    In: Proc

    Jiang, J.Y., Chang, W.C., Zhang, J., et al.: Relevance under the Iceberg: Reason- able Prediction for Extreme Multi-label Classification. In: Proc. of the 45th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. pp. 1870–1874. SIGIR ’22, ACM (2022)

  14. [14]

    Scientific Data3, 160035 (May 2016)

    Johnson, A.E.W., Pollard, T.J., Shen, L., et al.: MIMIC-III, a freely accessible critical care database. Scientific Data3, 160035 (May 2016)

  15. [15]

    In: Advances in Intelligent Data Analysis XI, vol

    Largeron, C., Moulin, C., Géry, M.: MCut: A Thresholding Strategy for Multi-label Classification. In: Advances in Intelligent Data Analysis XI, vol. 7619, pp. 172–183. Springer Berlin Heidelberg (2012)

  16. [16]

    Lewis, D.D.: Reuters-21578 Text Categorization Test Collection, Distribution 1.0 (1997),kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

  17. [17]

    Mishra, N., Rohaninejad, M., Chen, X., et al.: A Simple Neural Attentive Meta- Learner. In: Int. Conf. on Learning Representations (2018)

  18. [18]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, Agarwal, S., Ahmad, L., et al.: gpt-oss-120b & gpt-oss-20b Model Card (Aug 2025). https://doi.org/10.48550/arXiv.2508.10925

  19. [19]

    Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning

    Papernot, N., McDaniel, P.: Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning (Mar 2018). https://doi.org/10.48550/arXiv.1803.04765

  20. [20]

    Machine Learning85(3), 333–359 (Dec 2011)

    Read, J., Pfahringer, B., Holmes, G., et al.: Classifier chains for multi-label classi- fication. Machine Learning85(3), 333–359 (Dec 2011)

  21. [21]

    https://doi.org/10.48550/arXiv.2505.03118

    Shamatrin, D.: Adaptive Thresholding for Multi-Label Classification via Global- Local Signal Fusion (May 2025). https://doi.org/10.48550/arXiv.2505.03118

  22. [22]

    In: Proc

    Simon, C., Koniusz, P., Harandi, M.: Meta-Learning for Multi-Label Few-Shot Classification. In: Proc. of the IEEE/CVF Winter Conf. on Applications of Com- puter Vision. pp. 346–355 (2022)

  23. [23]

    Artificial Intelligence102(2), 249–293 (Jul 1998)

    Smyth, B., Keane, M.T.: Adaptation-guided retrieval: questioning the similarity assumption in reasoning. Artificial Intelligence102(2), 249–293 (Jul 1998)

  24. [24]

    In:AdvancesinNeuralInformationProcessingSystems.vol.30.CurranAssociates, Inc

    Snell, J., Swersky, K., Zemel, R.S.: Prototypical Networks for Few-shot Learning. In:AdvancesinNeuralInformationProcessingSystems.vol.30.CurranAssociates, Inc. (2017)

  25. [25]

    In: Proc

    Sung, F., Yang, Y., Zhang, L., et al.: Learning to Compare: Relation Network for Few-shot Learning. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. pp. 1199–1208 (2018)

  26. [26]

    In: Proc

    Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Proc. of the 18th Int. Conf. on World Wide Web. pp. 211–220. WWW ’09, ACM (2009)

  27. [27]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., et al.: Llama 2: Open Foundation and Fine- Tuned Chat Models (Jul 2023). https://doi.org/10.48550/arXiv.2307.09288

  28. [28]

    Tsoumakas, G., Katakis, I.: Multi-Label Classification: An Overview. Int. Journal of Data Warehousing and Mining (IJDWM)3(3), 1–13 (Jul 2007)

  29. [29]

    In: Machine Learning: ECML 2007, vol

    Tsoumakas, G., Vlahavas, I.: Random k-Labelsets: An Ensemble Method for Mul- tilabel Classification. In: Machine Learning: ECML 2007, vol. 4701, pp. 406–417. Springer Berlin Heidelberg (2007) RAPT for Multi-Label Classification 17

  30. [30]

    Vasylevskyi, V.: events_classification_biotech (2025),https://huggingface.co/ datasets/knowledgator/events_classification_biotech, huggingFace dataset

  31. [31]

    In: Advances in Neural Information Processing Systems

    Vinyals, O., Blundell, C., Lillicrap, T., et al.: Matching Networks for One Shot Learning. In: Advances in Neural Information Processing Systems. vol. 29. Curran Associates, Inc. (2016)

  32. [32]

    In: Proc

    Wu, J., Xiong, W., Wang, W.Y.: Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification. In: Proc. of the 2019 Conf. on EMNLP- IJCNLP. pp. 4354–4364. ACL (Nov 2019)

  33. [33]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., et al.: Qwen3 Technical Report (2025). https://doi.org/10.48550/arXiv.2505.09388

  34. [34]

    In: Proc

    Yang, Y.: A study of thresholding strategies for text categorization. In: Proc. of the 24th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. pp. 137–145. SIGIR ’01, ACM (2001)

  35. [35]

    Pattern Recognition40(7), 2038–2048 (Jul 2007)

    Zhang, M.L., Zhou, Z.H.: ML-KNN: A lazy learning approach to multi-label learn- ing. Pattern Recognition40(7), 2038–2048 (Jul 2007)

  36. [36]

    IEEE Trans

    Zhang, M.L., Zhou, Z.H.: A Review on Multi-Label Learning Algorithms. IEEE Trans. on Knowledge and Data Engineering26(8), 1819–1837 (2014)

  37. [37]

    A Survey on Efficient Inference for Large Language Models

    Zhou, Z., Ning, X., Hong, K., et al.: A Survey on Efficient Inference for Large Language Models (Jul 2024). https://doi.org/10.48550/arXiv.2404.14294