arxiv: 2604.15203 · v1 · submitted 2026-04-16 · 💻 cs.CL

Recognition: unknown

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

Raunak Agarwal , Markus Wenzel , Simon Baur , Jonas Zimmer , George Harvey , Jackie Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords multi-label text classificationuncertainty quantificationmedical device adverse eventsliving benchmarklong-tailed labelstemporal splitsmodel trade-offs

0 comments

The pith

A living benchmark for multi-label classification of medical device adverse event reports reveals clear trade-offs between predictive accuracy and reliable uncertainty quantification across model types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MADE as a continuously updated benchmark drawn from medical device adverse event reports to overcome saturation and contamination issues in existing multi-label text classification datasets. Strict temporal splits enable reproducible evaluation on hierarchical long-tailed labels while new reports are added over time. Baselines across more than 20 models in fine-tuning and few-shot regimes compare entropy, consistency, and self-verbalized uncertainty methods. The evaluation identifies specific patterns: smaller discriminatively fine-tuned decoders deliver the strongest overall accuracy with competitive uncertainty scores, generative fine-tuning produces the most trustworthy uncertainty estimates, large reasoning models handle rare labels better yet show weak uncertainty quantification, and self-verbalized confidence fails as a reliable proxy.

Core claim

MADE is a living MLTC benchmark derived from medical device adverse event reports that is updated continuously with newly published reports and evaluated under strict temporal splits to prevent contamination. Systematic baselines on encoder and decoder models demonstrate that smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ, generative fine-tuning delivers the most reliable UQ, large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ, and self-verbalized confidence is not a reliable proxy for uncertainty.

What carries the argument

The MADE benchmark, a living collection of hierarchical long-tailed multi-label medical device adverse event reports paired with a strict temporal split evaluation protocol.

If this is right

Smaller discriminatively fine-tuned decoder models provide the best practical balance of head-to-tail accuracy and uncertainty quantification for this task.
Generative fine-tuning should be preferred when reliable uncertainty estimates are the primary requirement.
Large reasoning models can be used to improve coverage of rare labels but require additional uncertainty calibration techniques.
Self-verbalized confidence cannot be trusted as a substitute for entropy- or consistency-based uncertainty methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Healthcare deployment choices may need to trade accuracy for trustworthy uncertainty depending on whether the priority is overall correctness or safe human oversight of low-confidence cases.
The living update mechanism allows future testing of whether models can handle genuinely novel adverse event patterns that did not exist in earlier data.
The observed weakness in uncertainty quantification for large reasoning models suggests that scaling alone does not solve calibration issues in long-tailed multi-label medical settings.

Load-bearing premise

Continuous addition of new reports together with strict temporal splits fully prevents training data contamination, and the hierarchical labels accurately reflect real-world dependencies and imbalances.

What would settle it

A model trained only on reports published before a given cutoff date achieving equal or higher accuracy and UQ calibration on reports published after that cutoff would indicate that temporal splits failed to block contamination.

Figures

Figures reproduced from arXiv: 2604.15203 by George Harvey, Jackie Ma, Jonas Zimmer, Markus Wenzel, Raunak Agarwal, Simon Baur.

**Figure 1.** Figure 1: Top: Overview of the benchmarking setup, encompassing discriminative and generative language models, learning paradigms (discriminative or generative fine-tuning and few-shot prompting), and uncertainty quantification (UQ) approaches. Bottom, left: Multi-label text classification of medical device adverse events, each annotated with hierarchical product and patient problem labels. Bottom, right: UQ quality… view at source ↗

**Figure 2.** Figure 2: Product and patient problems are the hierarchical multi-labels of MADE. The outer ring shows the fifty most frequent product or patient problems in the test set, grouped by their parent classes (middle ring) and grandparent classes (inner ring) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Log-scaled distribution of label frequencies illustrating severe imbalance and the pronounced long-tail pattern. Few-shot prompting/in-context learning. We locally host Llama 3.2-3B, 3.1-8B, and 3.1-70B, DeepSeek-R1 (37B active, 671B total; DeepSeekAI et al., 2025), Qwen3 (4B, 30B, 235B; Yang et al., 2025), Kimi K2 (32B activated; 1T total; Kimi-Team et al., 2025), gpt-oss-120b (5.1B active, 117B total;… view at source ↗

**Figure 4.** Figure 4: summarizes all results per paradigm with individual metrics shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of the PRR for UQ evaluation: Rejection curves are shown for Llama-3.1-70B-Instruct under three strategies – uncertainty-based (blue, most uncertain predictions are rejected first), oracle (green, best case, discarding predictions achieving lowest Jaccard scores first), and random (red, baseline). PRR is the ratio of the model surplus to the oracle surplus: area where the uncertainty-based … view at source ↗

read the original abstract

Machine learning in high-stakes domains such as healthcare requires not only strong predictive performance but also reliable uncertainty quantification (UQ) to support human oversight. Multi-label text classification (MLTC) is a central task in this domain, yet remains challenging due to label imbalances, dependencies, and combinatorial complexity. Existing MLTC benchmarks are increasingly saturated and may be affected by training data contamination, making it difficult to distinguish genuine reasoning capabilities from memorization. We introduce MADE, a living MLTC benchmark derived from {m}edical device {ad}verse {e}vent reports and continuously updated with newly published reports to prevent contamination. MADE features a long-tailed distribution of hierarchical labels and enables reproducible evaluation with strict temporal splits. We establish baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings (instruction-tuned/reasoning variants, local/API-accessible). We systematically assess entropy-/consistency-based and self-verbalized UQ methods. Results show clear trade-offs: smaller discriminatively fine-tuned decoders achieve the strongest head-to-tail accuracy while maintaining competitive UQ; generative fine-tuning delivers the most reliable UQ; large reasoning models improve performance on rare labels yet exhibit surprisingly weak UQ; and self-verbalized confidence is not a reliable proxy for uncertainty. Our work is publicly available at https://hhi.fraunhofer.de/aml-demonstrator/made-benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MADE gives a fresh living benchmark for medical MLTC plus UQ baselines, but the temporal-split contamination defense looks thin for large pre-trained models.

read the letter

The paper's main contribution is MADE, a continuously updated dataset of medical device adverse event reports turned into a multi-label text classification benchmark with strict temporal splits and hierarchical long-tailed labels. It runs baselines on more than 20 encoder and decoder models in fine-tuning and few-shot regimes, then compares entropy, consistency, and self-verbalized uncertainty methods. The reported trade-offs are straightforward: smaller discriminatively fine-tuned decoders lead on head-to-tail accuracy while staying competitive on UQ, generative fine-tuning gives the most reliable uncertainty estimates, and large reasoning models help on rare labels but show weak UQ overall. Self-verbalized confidence does not track actual uncertainty well. That package is useful for anyone who needs a non-saturated testbed in a regulated domain where label imbalance and dependencies matter.

Referee Report

2 major / 2 minor

Summary. The paper introduces MADE, a living benchmark for multi-label text classification derived from medical device adverse event reports. It is continuously updated with new reports and uses strict temporal splits to prevent training data contamination. The work establishes baselines across more than 20 encoder- and decoder-only models under fine-tuning and few-shot settings, systematically evaluates entropy-/consistency-based and self-verbalized UQ methods, and reports empirical trade-offs: smaller discriminatively fine-tuned decoders achieve strongest head-to-tail accuracy with competitive UQ; generative fine-tuning yields most reliable UQ; large reasoning models improve on rare labels but show weak UQ; and self-verbalized confidence is not a reliable proxy.

Significance. If the results hold after addressing verification gaps, this benchmark would be a useful contribution to NLP for high-stakes healthcare applications by tackling label imbalance, hierarchical dependencies, and contamination issues that saturate existing MLTC datasets. The living design and public release enable ongoing reproducible evaluation. The focus on UQ alongside accuracy is relevant for oversight in medical domains. No machine-checked proofs or parameter-free derivations are present, but the empirical scope on real-world long-tailed data is a strength if properly documented.

major comments (2)

Abstract: The abstract asserts specific performance trade-offs across models and UQ methods but supplies no methods details, statistical tests, error bars, or data characteristics (e.g., number of reports, label hierarchy depth, or imbalance ratios), leaving major gaps that prevent verification of the central claims about model superiority and UQ reliability.
Abstract: The claim that strict temporal splits combined with continuous updating effectively prevents training data contamination is load-bearing for all reported results, yet no evidence is provided of model training data cutoffs versus test report dates or decontamination audits such as n-gram overlap or embedding similarity checks. This is especially critical for API and large reasoning models whose pre-training may include overlapping medical corpora.

minor comments (2)

The abstract uses '{m}edical device {ad}verse {e}vent' which appears to be a formatting artifact for acronym expansion; clarify in the full text.
Ensure the full manuscript defines all UQ methods (entropy, consistency, self-verbalized) with explicit formulas or pseudocode to support reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of clarity and verifiability in our presentation of the MADE benchmark. We address each major comment below and indicate the revisions planned for the manuscript.

read point-by-point responses

Referee: Abstract: The abstract asserts specific performance trade-offs across models and UQ methods but supplies no methods details, statistical tests, error bars, or data characteristics (e.g., number of reports, label hierarchy depth, or imbalance ratios), leaving major gaps that prevent verification of the central claims about model superiority and UQ reliability.

Authors: We agree that the abstract prioritizes high-level claims over granular details. In the revised version, we will expand the abstract to include key data characteristics drawn from Section 2 of the manuscript, such as the scale of the report collection, the depth of the hierarchical label taxonomy, and the long-tailed imbalance ratios. We will also explicitly reference that all quantitative results in the paper include error bars computed over multiple runs and appropriate statistical tests for model comparisons. Full methodological specifications for the models, fine-tuning procedures, and UQ methods remain in Sections 3 and 4 to preserve abstract readability. These additions will directly support verification of the reported trade-offs. revision: yes
Referee: Abstract: The claim that strict temporal splits combined with continuous updating effectively prevents training data contamination is load-bearing for all reported results, yet no evidence is provided of model training data cutoffs versus test report dates or decontamination audits such as n-gram overlap or embedding similarity checks. This is especially critical for API and large reasoning models whose pre-training may include overlapping medical corpora.

Authors: We recognize the critical importance of substantiating the contamination-resistance claim. The manuscript details the temporal splitting protocol in Section 3.2, which assigns reports to train/validation/test partitions strictly according to their publication dates, ensuring test reports postdate any feasible training cutoff for the models we evaluate. For the open-source models we fine-tune, we have now added explicit documentation of these dates along with n-gram overlap and embedding similarity audits against known pre-training corpora. In the revision, we will incorporate these checks into a new subsection. For proprietary API models and large reasoning models, however, pre-training data and exact cutoffs are not disclosed by the providers, so exhaustive audits are not feasible on our side. We will add a clear limitations statement acknowledging this constraint while noting that the temporal split still provides a practical and reproducible safeguard for the benchmark's ongoing use. revision: partial

standing simulated objections not resolved

Complete decontamination audits for closed-source API and large reasoning models, as their pre-training data and cutoffs are not publicly available.

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivations or self-referential steps

full rationale

The paper introduces the MADE benchmark dataset from medical device adverse event reports and reports empirical results from evaluating over 20 models under fine-tuning and few-shot settings using standard accuracy and uncertainty quantification metrics. No equations, derivations, fitted parameters, or predictive claims appear in the abstract or described full text; all reported trade-offs (e.g., smaller decoders on head-to-tail accuracy, generative fine-tuning on UQ) are direct outputs of external model evaluations on temporally split data rather than reductions to internal definitions or self-citations. The living benchmark design and strict temporal splits are explicit construction choices for contamination resistance, not derived quantities. No self-citation load-bearing, uniqueness theorems, or ansatz smuggling are invoked. The work is therefore self-contained against external benchmarks and model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the chosen data source and evaluation protocol rather than mathematical axioms or parameters. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Medical device adverse event reports provide a representative corpus with hierarchical long-tailed labels suitable for MLTC and UQ evaluation.
This underpins the benchmark construction and is invoked in the abstract's description of the data and task.

pith-pipeline@v0.9.0 · 5566 in / 1270 out tokens · 39385 ms · 2026-05-10T11:16:02.825459+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 8 canonical work pages · 2 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing reasoning capa- bility in LLMs via reinforcement learning.Preprint, arXiv:2501.12948. Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Ger- stein, and Arman Cohan. 2024. Investigating data contamination in modern benchmarks for large lan- guage models. InProceedings of the 2024 Confer- ence of the North American Chapter of the Asso...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

SGDR: Stochastic Gradient Descent with Warm Restarts

Hierarchical deep learning for multi-label im- balanced text classification of economic literature. Applied Soft Computing, 176:113189. Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2024. Generating with confidence: Uncertainty quantifica- tion for black-box large language models.Transac- tions on Machine Learning Research. S. Lloyd. 1982. Least squares qu...

work page Pith review arXiv 2024
[3]

Jenish Maharjan, Anurag Garikipati, Navan Preet Singh, Leo Cyrus, Mayank Sharma, Madalina Ciobanu, Gina Barnes, Rahul Thapa, Qingqing Mao, and Ri- tankar Das

Large language models do multi-label classifi- cation differently.Preprint, arXiv:2505.17510. Jenish Maharjan, Anurag Garikipati, Navan Preet Singh, Leo Cyrus, Mayank Sharma, Madalina Ciobanu, Gina Barnes, Rahul Thapa, Qingqing Mao, and Ri- tankar Das. 2024. OpenMedLM: prompt engineer- ing can out-perform fine-tuning in medical question- answering with op...

work page arXiv 2024
[4]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

UMAP: Uniform manifold approximation and projection for dimension reduction.Preprint, arXiv:1802.03426. Bálint Mucsányi, Michael Kirchhof, and Seong Joon Oh. 2024. Benchmarking uncertainty disentangle- ment: Specialized uncertainties for specialized tasks. Advances in neural information processing systems, 37:50972–51038. James Mullenbach, Sarah Wiegreffe...

work page internal anchor Pith review arXiv 2024
[5]

T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., et al

Can generalist foundation models outcom- pete special-purpose tuning? Case study in medicine. arXiv preprint arXiv:2311.16452. Jaya Ojha, Oriana Presacan, Pedro G. Lind, Eric Mon- teiro, and Anis Yazidi. 2025. Navigating uncertainty: A user-perspective survey of trustworthiness of AI in healthcare.ACM Transactions on Computing for Healthcare, 6(3):1–32. T...

work page arXiv 2025
[6]

InProceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 798– 805, Istanbul, Turkey

JRC Eurovoc indexer JEX - a freely available multi-label categorisation tool. InProceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 798– 805, Istanbul, Turkey. European Language Resources Association (ELRA). Pingjie Tang, Meng Jiang, Bryan (Ning) Xia, Jed W. Pitera, Jeffrey Welser, and Nitesh V . Chawla...

work page arXiv 2020
[7]

Seq vs Seq: An open suite of paired encoders and decoders.Preprint, arXiv:2507.11412. Jiageng Wu, Bowen Gu, Ren Zhou, Kevin Xie, Doug Snyder, Yixing Jiang, Valentina Carducci, Richard Wyss, Rishi J Desai, Emily Alsentzer, Leo Anthony Celi, Adam Rodman, Sebastian Schneeweiss, Jonathan H. Chen, Santiago Romero- Brufau, Kueiyu Joshua Lin, and Jie Yang. 2025....

work page arXiv 2025
[8]

Benchmarking benchmark leakage in large language models.Preprint, arXiv:2404.18824. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Day- iheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. Qwen3 technical report.Prepri...

work page arXiv 2025
[9]

We have not found official information on the knowledge cutoff dates for DeepSeek-R1, Qwen3, and Kimi K2 (released in January, April, and July 2025, respectively)

was pre-trained including on the ‘DOLMino mix 1124’ dataset (OLMo et al., 2025, which men- tions data from September 2024), among other data sources. We have not found official information on the knowledge cutoff dates for DeepSeek-R1, Qwen3, and Kimi K2 (released in January, April, and July 2025, respectively). Llama-3.3-Nemotron-49B- v1.5 used post-trai...

2025
[10]

few - shot

and K-means clustering (Lloyd, 1982) and named with GPT-5. Table A.1 compares a hierarchical loss (HYDRA, Karl and Scherp, 2025) with standard binary cross- entropy for Ettin models. Instead of training a single classification head over the entire label space, HYDRA partitions labels by their hierarchy level and assigns a dedicated classification head to ...

1982
[11]

A " are Medical Device Problems . - Labels that start with

The taxonomy of labels is provided within the < labels > tag . - Labels are separated by newlines ; a definition for the label is provided . - We are in a 3 - level hierarchical multi - label classification setting - this means that when a child label ( such as A040507 ) is selected , the parent label ( A0405 ) and grandparent label ( A04 ) must also be s...
[12]

few - shot

There are 10 " few - shot " examples included in the <few - shot - examples > tag . - Each example includes a report and its corresponding labels . - The examples included in the <few - shot - examples > tag were chosen using a K - Nearest Neighbours algorithm which picked reports similar in content to the text which needs classifying . The labels shown f...
[13]

- Assign all labels that are relevant

Your goal is to classify the text provided within the < classification - text > tag . - Assign all labels that are relevant . - You can choose multiple labels , a single label , or no labels if none apply . - Always use the exact label names from the label list provided in the taxonomy under the < labels > tag . Do not invent new labels or modify existing...
[14]

For example : A04 A0405 A040507 E01 E0101 # IMPORTANT * In your final output , you must not include any extra text , explanations , or formatting outside the label list

Provide your output as a list of labels , each on a new line . For example : A04 A0405 A040507 E01 E0101 # IMPORTANT * In your final output , you must not include any extra text , explanations , or formatting outside the label list . Only return the list of labels separated by newlines .* """ Type/modelMacro F1↑J↑ PRR↑ρ↓ECE + ↓ Overall Head Medium Tail ET...
[15]

First , familiarize yourself with the label definitions : < labels > A01 : Patient Device Interaction Problem - Problem related to the interaction between the patient and the device . A0101 : Patient - Device Incompatibility - Problem associated with the interaction between the patient's physiology or anatomy and the device that affects the patient and / ...
[16]

Review these few - shot examples of similar reports and their corresponding labels : <few - shot - examples > { EXAMPLES } </ few - shot - examples >
[17]

"" A.7.3 Variation With Self-Verbalized Confidence

Now , carefully classify the following report : < classification - text > { C L A S S I F I C A T I O N _ T E X T } </ classification - text >""" A.7.3 Variation With Self-Verbalized Confidence """ ... - Assign all labels that are relevant . - You can choose multiple labels , a single label , or no labels if none apply . - Always use the exact label names...