RAPT: Retrieval-Augmented Post-hoc Thresholding for Multi-Label Classification

Darren Nicol; Ikechukwu Nkisi-Orji; Lasal Jayawardena; Nirmalie Wiratunga

arxiv: 2605.16535 · v1 · pith:HHWZIQSQnew · submitted 2026-05-15 · 💻 cs.IR · cs.AI

RAPT: Retrieval-Augmented Post-hoc Thresholding for Multi-Label Classification

Lasal Jayawardena , Nirmalie Wiratunga , Ikechukwu Nkisi-Orji , Darren Nicol This is my paper

Pith reviewed 2026-05-19 21:28 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords multi-label classificationpost-hoc thresholdingretrieval augmentationdocument classificationlabel set selectionindustrial pipelinesmetric learning

0 comments

The pith

RAPT adapts label selection thresholds by retrieving similar past documents and aggregating their outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Industrial multi-label document classification relies on scoring candidate labels and then choosing a threshold to form each document's label set. Fixed global or label-wise thresholds often fail under OCR noise, label imbalance, varying label counts per document, and changing formats. RAPT solves this by treating past documents as cases: for a new query it retrieves similar documents in the classifier's representation space and adapts the threshold using the outcomes observed in those cases. A sympathetic reader cares because the choice of labels directly determines the accuracy of all downstream extraction and the amount of manual verification required. The method works as a model-agnostic post-hoc wrapper on any predictor that supplies both representations and per-label scores.

Core claim

RAPT is a deployment-oriented retrieval-augmented score thresholding wrapper. For each query document, given a classifier's score vector, it retrieves similar document thresholding situations and adapts the query's label set selection threshold by locally aggregating neighbour solutions such as average label count or cutoff calibration. This post-hoc adaptation improves label set selection without retraining the underlying classifier.

What carries the argument

Retrieval of similar document thresholding situations (cases) from the classifier's representation space, followed by local aggregation of their outcomes to adapt the current threshold.

If this is right

RAPT consistently outperforms global and label-wise static thresholding baselines on both public benchmarks and industrial data.
Best results occur when RAPT is paired with metric learners, reaching 0.87 Macro-F1 in the industrial setting.
Transformer-based models with RAPT average 0.775 Macro-F1 and outperform few-shot LLM baselines by a factor of two while using far less inference time and memory.
The wrapper can be applied to any model that outputs both document representations for similarity search and per-label confidence scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Representation spaces trained for classification appear to encode enough structure that nearest-neighbor documents also share useful thresholding behavior.
Larger case bases of past documents could make the adaptation more robust as document formats continue to evolve.
The same retrieval-plus-aggregation pattern may apply to other post-hoc calibration problems where a single global rule is insufficient.

Load-bearing premise

Documents retrieved by similarity on the classifier's representation space will have thresholding situations whose outcomes are relevant and transferable to the query document's optimal label set.

What would settle it

Performance of RAPT falls to or below the static baseline when retrieval is replaced by random selection of past documents or when the adaptation step is disabled.

Figures

Figures reproduced from arXiv: 2605.16535 by Darren Nicol, Ikechukwu Nkisi-Orji, Lasal Jayawardena, Nirmalie Wiratunga.

**Figure 1.** Figure 1: Industry use case pipeline for document processing and information extraction. Left: a typical document with areas indicated for class label assignment. Top: multilabel predictions act as routing decisions for downstream extraction tasks. Bottom: typical prediction scores from a backbone model on y-axis, shown relative to class specific optimal thresholds, with x-axis showing the label set. This illustrat… view at source ↗

**Figure 2.** Figure 2: CBR wrapper components, illustrated against a grey background. Given a query document, the backbone model, f, produces label scores and an embedding, which are used to retrieve similar cases from the casebase. Retrieved predictions and labels are then combined in an adaptation step to produce a locally adjusted prediction, followed by threshold calibration to obtain the final multi label output [PITH_FULL… view at source ↗

**Figure 3.** Figure 3: Cross-dataset accuracy summary across all models. Bars show the Rapt win rate per dataset, while lines show mean improvement in Macro-F1 and Micro-F1 relative to the best static baseline. Labels above bars indicate the number of wins out of 25 model configurations (4 metric learners + 7 Transformers x 3 modes). 6 Conclusion This paper addressed a practical problem in industrial multi-label document classi… view at source ↗

read the original abstract

Industrial multi-label document understanding pipelines score candidate labels and threshold or rank them to form a label set per document. This early selection step directly affects the accuracy of downstream information extraction from the document, as well as the associated verification effort. In practice, OCR noise, label imbalance, instance-dependent label cardinality, and asymmetric error costs make global score thresholds brittle and hard to maintain as document formats evolve. We present RAPT, a deployment-oriented retrieval-augmented score thresholding wrapper, applied post-hoc to improve label set selection without retraining the underlying classifier. RAPT is a model-agnostic wrapper: any predictor that provides document representations for similarity search and per label confidence scores can be used, including metric learning encoders and fine-tuned transformer classifiers. For each query document, given a classifier's score vector, RAPT retrieves similar document thresholding situations (cases) and adapts the query's label set selection threshold using their outcomes. The adaptation selects the final label set by locally aggregating neighbour solutions (e.g. average label count, cutoff calibration). Evaluation compared multi-label classifiers (metric learners and transformers) combined with RAPT against global and label-wise thresholding baselines, and against few-shot LLMs. Across an industrial dataset and six public benchmarks, RAPT consistently outperformed global and label-wise static thresholding baselines. In the industrial setting, RAPT achieved its best predictive performance with metric learners, reaching 0.87 Macro-F1, while fine-tuned transformer variants on average achieved 0.775 Macro-F1, outperforming fewshot LLM baselines (K = 5) by 2x and requiring at least 115x less inference time and 13.5x less GPU memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces RAPT, a model-agnostic retrieval-augmented wrapper for post-hoc thresholding in multi-label classification. Given a classifier's score vector and document representation, RAPT retrieves similar past documents in embedding space and adapts the query threshold by locally aggregating neighbor outcomes such as average label count or cutoff calibration. It is evaluated against global and label-wise static baselines plus few-shot LLMs on one industrial dataset and six public benchmarks, reporting consistent gains and a peak of 0.87 Macro-F1 with metric learners in the industrial setting.

Significance. If the central claim holds, the work offers practical value for industrial document pipelines by enabling instance-adaptive thresholding without retraining or heavy inference costs. The efficiency advantage over few-shot LLMs (115x less time, 13.5x less memory) is a clear strength, and the model-agnostic design broadens applicability to both metric learners and transformers.

major comments (2)

[Experiments] The evaluation reports consistent outperformance but supplies no quantitative metrics on retrieval quality, neighbor relevance, aggregation rules, statistical significance, or ablation of the adaptation step. This information is required to verify that gains derive from the retrieval-augmented mechanism rather than other factors.
[Approach] The adaptation step rests on the untested assumption that documents retrieved by similarity in the classifier's representation space share transferable thresholding situations. The motivating factors listed (OCR noise, instance-dependent cardinality, asymmetric costs) precisely suggest that embedding proximity need not correlate with optimal score cutoffs; no section or experiment demonstrates that the chosen similarity metric preserves thresholding-relevant structure.

minor comments (1)

[Abstract] The abstract states 'K = 5' for few-shot LLM baselines without clarifying what K denotes or how the comparison was controlled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify areas where additional evidence can strengthen the claims regarding the retrieval-augmented mechanism. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Experiments] The evaluation reports consistent outperformance but supplies no quantitative metrics on retrieval quality, neighbor relevance, aggregation rules, statistical significance, or ablation of the adaptation step. This information is required to verify that gains derive from the retrieval-augmented mechanism rather than other factors.

Authors: We agree that these supporting analyses are needed to isolate the contribution of the retrieval component. In the revised manuscript we will add: (i) quantitative retrieval-quality metrics including mean neighbor similarity and label-set overlap between query and neighbors; (ii) an ablation that disables local aggregation and compares performance to the full RAPT pipeline; (iii) statistical significance tests (paired Wilcoxon signed-rank) on the reported F1 improvements across the seven datasets; and (iv) explicit description and sensitivity results for the aggregation rules. These additions will be placed in a new subsection of the experimental evaluation. revision: yes
Referee: [Approach] The adaptation step rests on the untested assumption that documents retrieved by similarity in the classifier's representation space share transferable thresholding situations. The motivating factors listed (OCR noise, instance-dependent cardinality, asymmetric costs) precisely suggest that embedding proximity need not correlate with optimal score cutoffs; no section or experiment demonstrates that the chosen similarity metric preserves thresholding-relevant structure.

Authors: The concern is well-founded: the motivating factors could in principle decouple embedding proximity from threshold optimality. Our current defense rests on the empirical observation that RAPT yields consistent gains when the same representation space is used for both classification and retrieval, suggesting that the learned embeddings already encode label-relevant structure. Nevertheless, we have not provided a direct diagnostic. In the revision we will insert a new analysis that measures the correlation between neighbor cosine similarity and (a) label-cardinality difference and (b) the difference in per-instance optimal cutoffs derived from held-out validation. We will also discuss the extent to which the classifier's training objective encourages preservation of thresholding-relevant features. Any observed limitations will be acknowledged and listed as future work. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in RAPT derivation

full rationale

The paper introduces RAPT as a post-hoc, model-agnostic wrapper that retrieves similar documents in representation space and aggregates their thresholding outcomes (e.g., average label count or cutoff calibration) to adapt per-query thresholds. No equations, self-definitional loops, or fitted parameters are described that would make any claimed prediction or result equivalent to its inputs by construction. Performance evaluation is presented as empirical comparison against static baselines and few-shot LLMs on industrial and public datasets, with no reduction to self-citation chains or ansatzes smuggled from prior author work. The central inductive bias (embedding similarity correlating with optimal threshold transfer) is an explicit modeling assumption rather than a derived claim that collapses internally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review limits visibility into exact parameters and assumptions; core premise is that neighbor outcomes transfer to the query.

axioms (1)

domain assumption Similarity in document representation space implies similarity in useful thresholding behavior
Invoked by the retrieval step that selects neighbors for adaptation.

invented entities (1)

RAPT wrapper no independent evidence
purpose: Post-hoc adaptation of label thresholds via neighbor aggregation
New method introduced to solve the brittle global-threshold problem.

pith-pipeline@v0.9.0 · 5850 in / 1224 out tokens · 40697 ms · 2026-05-19T21:28:08.621601+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 5 internal anchors

[1]

AI Communications7(1), 39–59 (1994)

Aamodt, A., Plaza, E.: Case-Based Reasoning: Foundational Issues, Methodologi- cal Variations, and System Approaches. AI Communications7(1), 39–59 (1994)

work page 1994
[2]

In: Proc

Alsentzer, E., Murphy, J., Boag, W., et al.: Publicly Available Clinical BERT Em- beddings. In: Proc. of the 2nd Clinical Natural Language Processing Workshop. pp. 72–78. ACL (Jun 2019)

work page 2019
[3]

for Computational Linguistics

Chalkidis, I., Fergadiotis, E., Malakasiotis, P., et al.: Large-Scale Multi-Label Text ClassificationonEU Legislation.In:Proc.ofthe57thAnnualMeetingofthe Assoc. for Computational Linguistics. pp. 6314–6322. ACL (Jul 2019)

work page 2019
[4]

In: Findings of the Association for Computational Linguistics: EMNLP 2020

Chalkidis, I., Fergadiotis, M., Malakasiotis, P., et al.: LEGAL-BERT: The Muppets straight out of Law School. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 2898–2904. ACL (Nov 2020)

work page 2020
[5]

Artificial Intelligence170(16), 1175–1192 (Nov 2006)

Craw, S., Wiratunga, N., Rowe, R.C.: Learning adaptation knowledge to improve case-based reasoning. Artificial Intelligence170(16), 1175–1192 (Nov 2006)

work page 2006
[6]

In: Proc

Devlin, J., Chang, M.W., Lee, K., et al.: BERT: Pre-training of Deep Bidirec- tional Transformers for Language Understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. pp. 4171–4186. ACL (Jun 2019)

work page 2019
[7]

Machine Learning73(2), 133–153 (Nov 2008)

Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., et al.: Multilabel classification via calibrated label ranking. Machine Learning73(2), 133–153 (Nov 2008)

work page 2008
[8]

ACM Transactions on Computing for Healthcare3(1), 1–23 (Jan 2022)

Gu, Y., Tinn, R., Cheng, H., et al.: Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Transactions on Computing for Healthcare3(1), 1–23 (Jan 2022)

work page 2022
[9]

In: Proc

He, P., Gao, J., Chen, W.: DeBERTaV3: Improving DeBERTa using ELECTRA- Style Pre-Training with Gradient-Disentangled Embedding Sharing. In: Proc. of the 11th Int. Conf. on Learning Representations (2023) 16 L. Jayawardena et al

work page 2023
[10]

In: SIGIR ’94, pp

Hersh, W., Buckley, C., Leone, T.J., et al.: OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: SIGIR ’94, pp. 192–

work page
[11]

Springer London (1994)

work page 1994
[12]

In: Proc

Jain, H., Prabhu, Y., Varma, M.: Extreme Multi-label Loss Functions for Recom- mendation, Tagging, Ranking & Other Missing Label Applications. In: Proc. of the 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining. pp. 935–944. ACM (Aug 2016)

work page 2016
[13]

In: Proc

Jiang, J.Y., Chang, W.C., Zhang, J., et al.: Relevance under the Iceberg: Reason- able Prediction for Extreme Multi-label Classification. In: Proc. of the 45th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. pp. 1870–1874. SIGIR ’22, ACM (2022)

work page 2022
[14]

Scientific Data3, 160035 (May 2016)

Johnson, A.E.W., Pollard, T.J., Shen, L., et al.: MIMIC-III, a freely accessible critical care database. Scientific Data3, 160035 (May 2016)

work page 2016
[15]

In: Advances in Intelligent Data Analysis XI, vol

Largeron, C., Moulin, C., Géry, M.: MCut: A Thresholding Strategy for Multi-label Classification. In: Advances in Intelligent Data Analysis XI, vol. 7619, pp. 172–183. Springer Berlin Heidelberg (2012)

work page 2012
[16]

Lewis, D.D.: Reuters-21578 Text Categorization Test Collection, Distribution 1.0 (1997),kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

work page 1997
[17]

Mishra, N., Rohaninejad, M., Chen, X., et al.: A Simple Neural Attentive Meta- Learner. In: Int. Conf. on Learning Representations (2018)

work page 2018
[18]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, Agarwal, S., Ahmad, L., et al.: gpt-oss-120b & gpt-oss-20b Model Card (Aug 2025). https://doi.org/10.48550/arXiv.2508.10925

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10925 2025
[19]

Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning

Papernot, N., McDaniel, P.: Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning (Mar 2018). https://doi.org/10.48550/arXiv.1803.04765

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.04765 2018
[20]

Machine Learning85(3), 333–359 (Dec 2011)

Read, J., Pfahringer, B., Holmes, G., et al.: Classifier chains for multi-label classi- fication. Machine Learning85(3), 333–359 (Dec 2011)

work page 2011
[21]

https://doi.org/10.48550/arXiv.2505.03118

Shamatrin, D.: Adaptive Thresholding for Multi-Label Classification via Global- Local Signal Fusion (May 2025). https://doi.org/10.48550/arXiv.2505.03118

work page doi:10.48550/arxiv.2505.03118 2025
[22]

In: Proc

Simon, C., Koniusz, P., Harandi, M.: Meta-Learning for Multi-Label Few-Shot Classification. In: Proc. of the IEEE/CVF Winter Conf. on Applications of Com- puter Vision. pp. 346–355 (2022)

work page 2022
[23]

Artificial Intelligence102(2), 249–293 (Jul 1998)

Smyth, B., Keane, M.T.: Adaptation-guided retrieval: questioning the similarity assumption in reasoning. Artificial Intelligence102(2), 249–293 (Jul 1998)

work page 1998
[24]

In:AdvancesinNeuralInformationProcessingSystems.vol.30.CurranAssociates, Inc

Snell, J., Swersky, K., Zemel, R.S.: Prototypical Networks for Few-shot Learning. In:AdvancesinNeuralInformationProcessingSystems.vol.30.CurranAssociates, Inc. (2017)

work page 2017
[25]

In: Proc

Sung, F., Yang, Y., Zhang, L., et al.: Learning to Compare: Relation Network for Few-shot Learning. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. pp. 1199–1208 (2018)

work page 2018
[26]

In: Proc

Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Proc. of the 18th Int. Conf. on World Wide Web. pp. 211–220. WWW ’09, ACM (2009)

work page 2009
[27]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., et al.: Llama 2: Open Foundation and Fine- Tuned Chat Models (Jul 2023). https://doi.org/10.48550/arXiv.2307.09288

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023
[28]

Tsoumakas, G., Katakis, I.: Multi-Label Classification: An Overview. Int. Journal of Data Warehousing and Mining (IJDWM)3(3), 1–13 (Jul 2007)

work page 2007
[29]

In: Machine Learning: ECML 2007, vol

Tsoumakas, G., Vlahavas, I.: Random k-Labelsets: An Ensemble Method for Mul- tilabel Classification. In: Machine Learning: ECML 2007, vol. 4701, pp. 406–417. Springer Berlin Heidelberg (2007) RAPT for Multi-Label Classification 17

work page 2007
[30]

Vasylevskyi, V.: events_classification_biotech (2025),https://huggingface.co/ datasets/knowledgator/events_classification_biotech, huggingFace dataset

work page 2025
[31]

In: Advances in Neural Information Processing Systems

Vinyals, O., Blundell, C., Lillicrap, T., et al.: Matching Networks for One Shot Learning. In: Advances in Neural Information Processing Systems. vol. 29. Curran Associates, Inc. (2016)

work page 2016
[32]

In: Proc

Wu, J., Xiong, W., Wang, W.Y.: Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification. In: Proc. of the 2019 Conf. on EMNLP- IJCNLP. pp. 4354–4364. ACL (Nov 2019)

work page 2019
[33]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., et al.: Qwen3 Technical Report (2025). https://doi.org/10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[34]

In: Proc

Yang, Y.: A study of thresholding strategies for text categorization. In: Proc. of the 24th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. pp. 137–145. SIGIR ’01, ACM (2001)

work page 2001
[35]

Pattern Recognition40(7), 2038–2048 (Jul 2007)

Zhang, M.L., Zhou, Z.H.: ML-KNN: A lazy learning approach to multi-label learn- ing. Pattern Recognition40(7), 2038–2048 (Jul 2007)

work page 2038
[36]

IEEE Trans

Zhang, M.L., Zhou, Z.H.: A Review on Multi-Label Learning Algorithms. IEEE Trans. on Knowledge and Data Engineering26(8), 1819–1837 (2014)

work page 2014
[37]

A Survey on Efficient Inference for Large Language Models

Zhou, Z., Ning, X., Hong, K., et al.: A Survey on Efficient Inference for Large Language Models (Jul 2024). https://doi.org/10.48550/arXiv.2404.14294

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14294 2024

[1] [1]

AI Communications7(1), 39–59 (1994)

Aamodt, A., Plaza, E.: Case-Based Reasoning: Foundational Issues, Methodologi- cal Variations, and System Approaches. AI Communications7(1), 39–59 (1994)

work page 1994

[2] [2]

In: Proc

Alsentzer, E., Murphy, J., Boag, W., et al.: Publicly Available Clinical BERT Em- beddings. In: Proc. of the 2nd Clinical Natural Language Processing Workshop. pp. 72–78. ACL (Jun 2019)

work page 2019

[3] [3]

for Computational Linguistics

Chalkidis, I., Fergadiotis, E., Malakasiotis, P., et al.: Large-Scale Multi-Label Text ClassificationonEU Legislation.In:Proc.ofthe57thAnnualMeetingofthe Assoc. for Computational Linguistics. pp. 6314–6322. ACL (Jul 2019)

work page 2019

[4] [4]

In: Findings of the Association for Computational Linguistics: EMNLP 2020

Chalkidis, I., Fergadiotis, M., Malakasiotis, P., et al.: LEGAL-BERT: The Muppets straight out of Law School. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 2898–2904. ACL (Nov 2020)

work page 2020

[5] [5]

Artificial Intelligence170(16), 1175–1192 (Nov 2006)

Craw, S., Wiratunga, N., Rowe, R.C.: Learning adaptation knowledge to improve case-based reasoning. Artificial Intelligence170(16), 1175–1192 (Nov 2006)

work page 2006

[6] [6]

In: Proc

Devlin, J., Chang, M.W., Lee, K., et al.: BERT: Pre-training of Deep Bidirec- tional Transformers for Language Understanding. In: Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. pp. 4171–4186. ACL (Jun 2019)

work page 2019

[7] [7]

Machine Learning73(2), 133–153 (Nov 2008)

Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., et al.: Multilabel classification via calibrated label ranking. Machine Learning73(2), 133–153 (Nov 2008)

work page 2008

[8] [8]

ACM Transactions on Computing for Healthcare3(1), 1–23 (Jan 2022)

Gu, Y., Tinn, R., Cheng, H., et al.: Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Transactions on Computing for Healthcare3(1), 1–23 (Jan 2022)

work page 2022

[9] [9]

In: Proc

He, P., Gao, J., Chen, W.: DeBERTaV3: Improving DeBERTa using ELECTRA- Style Pre-Training with Gradient-Disentangled Embedding Sharing. In: Proc. of the 11th Int. Conf. on Learning Representations (2023) 16 L. Jayawardena et al

work page 2023

[10] [10]

In: SIGIR ’94, pp

Hersh, W., Buckley, C., Leone, T.J., et al.: OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In: SIGIR ’94, pp. 192–

work page

[11] [11]

Springer London (1994)

work page 1994

[12] [12]

In: Proc

Jain, H., Prabhu, Y., Varma, M.: Extreme Multi-label Loss Functions for Recom- mendation, Tagging, Ranking & Other Missing Label Applications. In: Proc. of the 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining. pp. 935–944. ACM (Aug 2016)

work page 2016

[13] [13]

In: Proc

Jiang, J.Y., Chang, W.C., Zhang, J., et al.: Relevance under the Iceberg: Reason- able Prediction for Extreme Multi-label Classification. In: Proc. of the 45th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. pp. 1870–1874. SIGIR ’22, ACM (2022)

work page 2022

[14] [14]

Scientific Data3, 160035 (May 2016)

Johnson, A.E.W., Pollard, T.J., Shen, L., et al.: MIMIC-III, a freely accessible critical care database. Scientific Data3, 160035 (May 2016)

work page 2016

[15] [15]

In: Advances in Intelligent Data Analysis XI, vol

Largeron, C., Moulin, C., Géry, M.: MCut: A Thresholding Strategy for Multi-label Classification. In: Advances in Intelligent Data Analysis XI, vol. 7619, pp. 172–183. Springer Berlin Heidelberg (2012)

work page 2012

[16] [16]

Lewis, D.D.: Reuters-21578 Text Categorization Test Collection, Distribution 1.0 (1997),kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

work page 1997

[17] [17]

Mishra, N., Rohaninejad, M., Chen, X., et al.: A Simple Neural Attentive Meta- Learner. In: Int. Conf. on Learning Representations (2018)

work page 2018

[18] [18]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, Agarwal, S., Ahmad, L., et al.: gpt-oss-120b & gpt-oss-20b Model Card (Aug 2025). https://doi.org/10.48550/arXiv.2508.10925

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10925 2025

[19] [19]

Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning

Papernot, N., McDaniel, P.: Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning (Mar 2018). https://doi.org/10.48550/arXiv.1803.04765

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.04765 2018

[20] [20]

Machine Learning85(3), 333–359 (Dec 2011)

Read, J., Pfahringer, B., Holmes, G., et al.: Classifier chains for multi-label classi- fication. Machine Learning85(3), 333–359 (Dec 2011)

work page 2011

[21] [21]

https://doi.org/10.48550/arXiv.2505.03118

Shamatrin, D.: Adaptive Thresholding for Multi-Label Classification via Global- Local Signal Fusion (May 2025). https://doi.org/10.48550/arXiv.2505.03118

work page doi:10.48550/arxiv.2505.03118 2025

[22] [22]

In: Proc

Simon, C., Koniusz, P., Harandi, M.: Meta-Learning for Multi-Label Few-Shot Classification. In: Proc. of the IEEE/CVF Winter Conf. on Applications of Com- puter Vision. pp. 346–355 (2022)

work page 2022

[23] [23]

Artificial Intelligence102(2), 249–293 (Jul 1998)

Smyth, B., Keane, M.T.: Adaptation-guided retrieval: questioning the similarity assumption in reasoning. Artificial Intelligence102(2), 249–293 (Jul 1998)

work page 1998

[24] [24]

In:AdvancesinNeuralInformationProcessingSystems.vol.30.CurranAssociates, Inc

Snell, J., Swersky, K., Zemel, R.S.: Prototypical Networks for Few-shot Learning. In:AdvancesinNeuralInformationProcessingSystems.vol.30.CurranAssociates, Inc. (2017)

work page 2017

[25] [25]

In: Proc

Sung, F., Yang, Y., Zhang, L., et al.: Learning to Compare: Relation Network for Few-shot Learning. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. pp. 1199–1208 (2018)

work page 2018

[26] [26]

In: Proc

Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Proc. of the 18th Int. Conf. on World Wide Web. pp. 211–220. WWW ’09, ACM (2009)

work page 2009

[27] [27]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., et al.: Llama 2: Open Foundation and Fine- Tuned Chat Models (Jul 2023). https://doi.org/10.48550/arXiv.2307.09288

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288 2023

[28] [28]

Tsoumakas, G., Katakis, I.: Multi-Label Classification: An Overview. Int. Journal of Data Warehousing and Mining (IJDWM)3(3), 1–13 (Jul 2007)

work page 2007

[29] [29]

In: Machine Learning: ECML 2007, vol

Tsoumakas, G., Vlahavas, I.: Random k-Labelsets: An Ensemble Method for Mul- tilabel Classification. In: Machine Learning: ECML 2007, vol. 4701, pp. 406–417. Springer Berlin Heidelberg (2007) RAPT for Multi-Label Classification 17

work page 2007

[30] [30]

Vasylevskyi, V.: events_classification_biotech (2025),https://huggingface.co/ datasets/knowledgator/events_classification_biotech, huggingFace dataset

work page 2025

[31] [31]

In: Advances in Neural Information Processing Systems

Vinyals, O., Blundell, C., Lillicrap, T., et al.: Matching Networks for One Shot Learning. In: Advances in Neural Information Processing Systems. vol. 29. Curran Associates, Inc. (2016)

work page 2016

[32] [32]

In: Proc

Wu, J., Xiong, W., Wang, W.Y.: Learning to Learn and Predict: A Meta-Learning Approach for Multi-Label Classification. In: Proc. of the 2019 Conf. on EMNLP- IJCNLP. pp. 4354–4364. ACL (Nov 2019)

work page 2019

[33] [33]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., et al.: Qwen3 Technical Report (2025). https://doi.org/10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[34] [34]

In: Proc

Yang, Y.: A study of thresholding strategies for text categorization. In: Proc. of the 24th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. pp. 137–145. SIGIR ’01, ACM (2001)

work page 2001

[35] [35]

Pattern Recognition40(7), 2038–2048 (Jul 2007)

Zhang, M.L., Zhou, Z.H.: ML-KNN: A lazy learning approach to multi-label learn- ing. Pattern Recognition40(7), 2038–2048 (Jul 2007)

work page 2038

[36] [36]

IEEE Trans

Zhang, M.L., Zhou, Z.H.: A Review on Multi-Label Learning Algorithms. IEEE Trans. on Knowledge and Data Engineering26(8), 1819–1837 (2014)

work page 2014

[37] [37]

A Survey on Efficient Inference for Large Language Models

Zhou, Z., Ning, X., Hong, K., et al.: A Survey on Efficient Inference for Large Language Models (Jul 2024). https://doi.org/10.48550/arXiv.2404.14294

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14294 2024