RAPT: Retrieval-Augmented Post-hoc Thresholding for Multi-Label Classification
Pith reviewed 2026-05-19 21:28 UTC · model grok-4.3
The pith
RAPT adapts label selection thresholds by retrieving similar past documents and aggregating their outcomes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RAPT is a deployment-oriented retrieval-augmented score thresholding wrapper. For each query document, given a classifier's score vector, it retrieves similar document thresholding situations and adapts the query's label set selection threshold by locally aggregating neighbour solutions such as average label count or cutoff calibration. This post-hoc adaptation improves label set selection without retraining the underlying classifier.
What carries the argument
Retrieval of similar document thresholding situations (cases) from the classifier's representation space, followed by local aggregation of their outcomes to adapt the current threshold.
If this is right
- RAPT consistently outperforms global and label-wise static thresholding baselines on both public benchmarks and industrial data.
- Best results occur when RAPT is paired with metric learners, reaching 0.87 Macro-F1 in the industrial setting.
- Transformer-based models with RAPT average 0.775 Macro-F1 and outperform few-shot LLM baselines by a factor of two while using far less inference time and memory.
- The wrapper can be applied to any model that outputs both document representations for similarity search and per-label confidence scores.
Where Pith is reading between the lines
- Representation spaces trained for classification appear to encode enough structure that nearest-neighbor documents also share useful thresholding behavior.
- Larger case bases of past documents could make the adaptation more robust as document formats continue to evolve.
- The same retrieval-plus-aggregation pattern may apply to other post-hoc calibration problems where a single global rule is insufficient.
Load-bearing premise
Documents retrieved by similarity on the classifier's representation space will have thresholding situations whose outcomes are relevant and transferable to the query document's optimal label set.
What would settle it
Performance of RAPT falls to or below the static baseline when retrieval is replaced by random selection of past documents or when the adaptation step is disabled.
Figures
read the original abstract
Industrial multi-label document understanding pipelines score candidate labels and threshold or rank them to form a label set per document. This early selection step directly affects the accuracy of downstream information extraction from the document, as well as the associated verification effort. In practice, OCR noise, label imbalance, instance-dependent label cardinality, and asymmetric error costs make global score thresholds brittle and hard to maintain as document formats evolve. We present RAPT, a deployment-oriented retrieval-augmented score thresholding wrapper, applied post-hoc to improve label set selection without retraining the underlying classifier. RAPT is a model-agnostic wrapper: any predictor that provides document representations for similarity search and per label confidence scores can be used, including metric learning encoders and fine-tuned transformer classifiers. For each query document, given a classifier's score vector, RAPT retrieves similar document thresholding situations (cases) and adapts the query's label set selection threshold using their outcomes. The adaptation selects the final label set by locally aggregating neighbour solutions (e.g. average label count, cutoff calibration). Evaluation compared multi-label classifiers (metric learners and transformers) combined with RAPT against global and label-wise thresholding baselines, and against few-shot LLMs. Across an industrial dataset and six public benchmarks, RAPT consistently outperformed global and label-wise static thresholding baselines. In the industrial setting, RAPT achieved its best predictive performance with metric learners, reaching 0.87 Macro-F1, while fine-tuned transformer variants on average achieved 0.775 Macro-F1, outperforming fewshot LLM baselines (K = 5) by 2x and requiring at least 115x less inference time and 13.5x less GPU memory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RAPT, a model-agnostic retrieval-augmented wrapper for post-hoc thresholding in multi-label classification. Given a classifier's score vector and document representation, RAPT retrieves similar past documents in embedding space and adapts the query threshold by locally aggregating neighbor outcomes such as average label count or cutoff calibration. It is evaluated against global and label-wise static baselines plus few-shot LLMs on one industrial dataset and six public benchmarks, reporting consistent gains and a peak of 0.87 Macro-F1 with metric learners in the industrial setting.
Significance. If the central claim holds, the work offers practical value for industrial document pipelines by enabling instance-adaptive thresholding without retraining or heavy inference costs. The efficiency advantage over few-shot LLMs (115x less time, 13.5x less memory) is a clear strength, and the model-agnostic design broadens applicability to both metric learners and transformers.
major comments (2)
- [Experiments] The evaluation reports consistent outperformance but supplies no quantitative metrics on retrieval quality, neighbor relevance, aggregation rules, statistical significance, or ablation of the adaptation step. This information is required to verify that gains derive from the retrieval-augmented mechanism rather than other factors.
- [Approach] The adaptation step rests on the untested assumption that documents retrieved by similarity in the classifier's representation space share transferable thresholding situations. The motivating factors listed (OCR noise, instance-dependent cardinality, asymmetric costs) precisely suggest that embedding proximity need not correlate with optimal score cutoffs; no section or experiment demonstrates that the chosen similarity metric preserves thresholding-relevant structure.
minor comments (1)
- [Abstract] The abstract states 'K = 5' for few-shot LLM baselines without clarifying what K denotes or how the comparison was controlled.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments identify areas where additional evidence can strengthen the claims regarding the retrieval-augmented mechanism. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Experiments] The evaluation reports consistent outperformance but supplies no quantitative metrics on retrieval quality, neighbor relevance, aggregation rules, statistical significance, or ablation of the adaptation step. This information is required to verify that gains derive from the retrieval-augmented mechanism rather than other factors.
Authors: We agree that these supporting analyses are needed to isolate the contribution of the retrieval component. In the revised manuscript we will add: (i) quantitative retrieval-quality metrics including mean neighbor similarity and label-set overlap between query and neighbors; (ii) an ablation that disables local aggregation and compares performance to the full RAPT pipeline; (iii) statistical significance tests (paired Wilcoxon signed-rank) on the reported F1 improvements across the seven datasets; and (iv) explicit description and sensitivity results for the aggregation rules. These additions will be placed in a new subsection of the experimental evaluation. revision: yes
-
Referee: [Approach] The adaptation step rests on the untested assumption that documents retrieved by similarity in the classifier's representation space share transferable thresholding situations. The motivating factors listed (OCR noise, instance-dependent cardinality, asymmetric costs) precisely suggest that embedding proximity need not correlate with optimal score cutoffs; no section or experiment demonstrates that the chosen similarity metric preserves thresholding-relevant structure.
Authors: The concern is well-founded: the motivating factors could in principle decouple embedding proximity from threshold optimality. Our current defense rests on the empirical observation that RAPT yields consistent gains when the same representation space is used for both classification and retrieval, suggesting that the learned embeddings already encode label-relevant structure. Nevertheless, we have not provided a direct diagnostic. In the revision we will insert a new analysis that measures the correlation between neighbor cosine similarity and (a) label-cardinality difference and (b) the difference in per-instance optimal cutoffs derived from held-out validation. We will also discuss the extent to which the classifier's training objective encourages preservation of thresholding-relevant features. Any observed limitations will be acknowledged and listed as future work. revision: yes
Circularity Check
No significant circularity detected in RAPT derivation
full rationale
The paper introduces RAPT as a post-hoc, model-agnostic wrapper that retrieves similar documents in representation space and aggregates their thresholding outcomes (e.g., average label count or cutoff calibration) to adapt per-query thresholds. No equations, self-definitional loops, or fitted parameters are described that would make any claimed prediction or result equivalent to its inputs by construction. Performance evaluation is presented as empirical comparison against static baselines and few-shot LLMs on industrial and public datasets, with no reduction to self-citation chains or ansatzes smuggled from prior author work. The central inductive bias (embedding similarity correlating with optimal threshold transfer) is an explicit modeling assumption rather than a derived claim that collapses internally.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Similarity in document representation space implies similarity in useful thresholding behavior
invented entities (1)
-
RAPT wrapper
no independent evidence
Reference graph
Works this paper leans on
-
[1]
AI Communications7(1), 39–59 (1994)
Aamodt, A., Plaza, E.: Case-Based Reasoning: Foundational Issues, Methodologi- cal Variations, and System Approaches. AI Communications7(1), 39–59 (1994)
work page 1994
- [2]
-
[3]
Chalkidis, I., Fergadiotis, E., Malakasiotis, P., et al.: Large-Scale Multi-Label Text ClassificationonEU Legislation.In:Proc.ofthe57thAnnualMeetingofthe Assoc. for Computational Linguistics. pp. 6314–6322. ACL (Jul 2019)
work page 2019
-
[4]
In: Findings of the Association for Computational Linguistics: EMNLP 2020
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., et al.: LEGAL-BERT: The Muppets straight out of Law School. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 2898–2904. ACL (Nov 2020)
work page 2020
-
[5]
Artificial Intelligence170(16), 1175–1192 (Nov 2006)
Craw, S., Wiratunga, N., Rowe, R.C.: Learning adaptation knowledge to improve case-based reasoning. Artificial Intelligence170(16), 1175–1192 (Nov 2006)
work page 2006
-
[6]
Devlin, J., Chang, M.W., Lee, K., et al.: BERT: Pre-training of Deep Bidirec- tional Transformers for Language Understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. pp. 4171–4186. ACL (Jun 2019)
work page 2019
-
[7]
Machine Learning73(2), 133–153 (Nov 2008)
Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., et al.: Multilabel classification via calibrated label ranking. Machine Learning73(2), 133–153 (Nov 2008)
work page 2008
-
[8]
ACM Transactions on Computing for Healthcare3(1), 1–23 (Jan 2022)
Gu, Y., Tinn, R., Cheng, H., et al.: Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Transactions on Computing for Healthcare3(1), 1–23 (Jan 2022)
work page 2022
- [9]
-
[10]
Hersh, W., Buckley, C., Leone, T.J., et al.: OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: SIGIR ’94, pp. 192–
-
[11]
Springer London (1994)
work page 1994
- [12]
- [13]
-
[14]
Scientific Data3, 160035 (May 2016)
Johnson, A.E.W., Pollard, T.J., Shen, L., et al.: MIMIC-III, a freely accessible critical care database. Scientific Data3, 160035 (May 2016)
work page 2016
-
[15]
In: Advances in Intelligent Data Analysis XI, vol
Largeron, C., Moulin, C., Géry, M.: MCut: A Thresholding Strategy for Multi-label Classification. In: Advances in Intelligent Data Analysis XI, vol. 7619, pp. 172–183. Springer Berlin Heidelberg (2012)
work page 2012
-
[16]
Lewis, D.D.: Reuters-21578 Text Categorization Test Collection, Distribution 1.0 (1997),kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
work page 1997
-
[17]
Mishra, N., Rohaninejad, M., Chen, X., et al.: A Simple Neural Attentive Meta- Learner. In: Int. Conf. on Learning Representations (2018)
work page 2018
-
[18]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI, Agarwal, S., Ahmad, L., et al.: gpt-oss-120b & gpt-oss-20b Model Card (Aug 2025). https://doi.org/10.48550/arXiv.2508.10925
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10925 2025
-
[19]
Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning
Papernot, N., McDaniel, P.: Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning (Mar 2018). https://doi.org/10.48550/arXiv.1803.04765
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.04765 2018
-
[20]
Machine Learning85(3), 333–359 (Dec 2011)
Read, J., Pfahringer, B., Holmes, G., et al.: Classifier chains for multi-label classi- fication. Machine Learning85(3), 333–359 (Dec 2011)
work page 2011
-
[21]
https://doi.org/10.48550/arXiv.2505.03118
Shamatrin, D.: Adaptive Thresholding for Multi-Label Classification via Global- Local Signal Fusion (May 2025). https://doi.org/10.48550/arXiv.2505.03118
- [22]
-
[23]
Artificial Intelligence102(2), 249–293 (Jul 1998)
Smyth, B., Keane, M.T.: Adaptation-guided retrieval: questioning the similarity assumption in reasoning. Artificial Intelligence102(2), 249–293 (Jul 1998)
work page 1998
-
[24]
In:AdvancesinNeuralInformationProcessingSystems.vol.30.CurranAssociates, Inc
Snell, J., Swersky, K., Zemel, R.S.: Prototypical Networks for Few-shot Learning. In:AdvancesinNeuralInformationProcessingSystems.vol.30.CurranAssociates, Inc. (2017)
work page 2017
- [25]
- [26]
-
[27]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., et al.: Llama 2: Open Foundation and Fine- Tuned Chat Models (Jul 2023). https://doi.org/10.48550/arXiv.2307.09288
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
-
[28]
Tsoumakas, G., Katakis, I.: Multi-Label Classification: An Overview. Int. Journal of Data Warehousing and Mining (IJDWM)3(3), 1–13 (Jul 2007)
work page 2007
-
[29]
In: Machine Learning: ECML 2007, vol
Tsoumakas, G., Vlahavas, I.: Random k-Labelsets: An Ensemble Method for Mul- tilabel Classification. In: Machine Learning: ECML 2007, vol. 4701, pp. 406–417. Springer Berlin Heidelberg (2007) RAPT for Multi-Label Classification 17
work page 2007
-
[30]
Vasylevskyi, V.: events_classification_biotech (2025),https://huggingface.co/ datasets/knowledgator/events_classification_biotech, huggingFace dataset
work page 2025
-
[31]
In: Advances in Neural Information Processing Systems
Vinyals, O., Blundell, C., Lillicrap, T., et al.: Matching Networks for One Shot Learning. In: Advances in Neural Information Processing Systems. vol. 29. Curran Associates, Inc. (2016)
work page 2016
- [32]
-
[33]
Yang, A., Li, A., Yang, B., et al.: Qwen3 Technical Report (2025). https://doi.org/10.48550/arXiv.2505.09388
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
- [34]
-
[35]
Pattern Recognition40(7), 2038–2048 (Jul 2007)
Zhang, M.L., Zhou, Z.H.: ML-KNN: A lazy learning approach to multi-label learn- ing. Pattern Recognition40(7), 2038–2048 (Jul 2007)
work page 2038
-
[36]
Zhang, M.L., Zhou, Z.H.: A Review on Multi-Label Learning Algorithms. IEEE Trans. on Knowledge and Data Engineering26(8), 1819–1837 (2014)
work page 2014
-
[37]
A Survey on Efficient Inference for Large Language Models
Zhou, Z., Ning, X., Hong, K., et al.: A Survey on Efficient Inference for Large Language Models (Jul 2024). https://doi.org/10.48550/arXiv.2404.14294
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14294 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.