Complete Evidence Extraction with Model Ensembles: A Case Study on Medical Coding

Katharina Beckh; Stefan R\"uping; Sven Heuser

arxiv: 2511.07055 · v3 · submitted 2025-11-10 · 💻 cs.CL · cs.IR· cs.LG

Complete Evidence Extraction with Model Ensembles: A Case Study on Medical Coding

Katharina Beckh , Sven Heuser , Stefan R\"uping This is my paper

Pith reviewed 2026-05-17 23:36 UTC · model grok-4.3

classification 💻 cs.CL cs.IRcs.LG

keywords complete evidence extractionRashomon ensemblesmedical codingfeature attributionmodel ensembleslanguage modelsexplainable AI

0 comments

The pith

Aggregating token evidence from multiple language models recovers more complete supporting information for medical coding decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines complete evidence extraction as the task of identifying every input token that supports a decision, which is required for regulatory and billing purposes in medicine. It tests whether ensembles drawn from the Rashomon set of equally accurate models can combine their feature attributions to capture evidence that any one model misses. Results show that such ensembles raise recall of human-annotated evidence while adding only a modest number of extra tokens. Even ensembles of three models already surpass the strongest single model.

Core claim

Rashomon ensembles formed by aggregating token-level feature attributions across several language models that perform equally well on medical coding increase evidence recall substantially compared with any individual model, while the added token count stays small; ensembles of only three models already recover information missed by the best single model when measured against human gold-standard annotations.

What carries the argument

Rashomon ensembles that combine token attributions from multiple high-performing language models to assemble a fuller set of evidence tokens.

If this is right

Evidence recall rises significantly while token overhead stays small.
Ensembles of only three models already beat the best single model.
Information missed by any one model is recovered through aggregation.
The method supplies the fuller evidence sets needed for regulatory compliance in medical billing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same aggregation idea could be tested in other regulated domains that demand exhaustive rather than minimal evidence, such as legal contract review.
Selecting ensemble members by maximizing diversity in their attribution maps might further reduce the number of models needed.
Pairing the ensemble output with a lightweight human verification step could turn the added tokens into reliable audit trails.

Load-bearing premise

The feature attributions produced by each model accurately mark the tokens that truly support the code according to human judgment, and merging them does not add many irrelevant tokens.

What would settle it

A human re-annotation study on the extra tokens returned by the ensemble finds that most are judged irrelevant, or a larger test shows no recall gain beyond the best single model.

Figures

Figures reproduced from arXiv: 2511.07055 by Katharina Beckh, Stefan R\"uping, Sven Heuser.

**Figure 2.** Figure 2: Recall for different ensemble sizes including all possible model c [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

High-stakes decisions informed by decision support systems require explicit evidence. While prior work focuses on short sufficient evidence, regulatory compliance and medical billing call for complete evidence: all relevant input tokens that support a decision. We formulate complete evidence extraction as a task and study it in a medical coding setting. Motivated by the Rashomon effect, we aggregate token-level evidence from multiple language models to increase evidence completeness. We perform a case study using existing equally-performing models, feature attributions, and a dataset with human-annotated evidence. Our results show that Rashomon ensembles significantly increase evidence recall while incurring only a small token overhead over individual models. Ensembles of only three models already outperform the best single model and recover information that individual models miss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines complete evidence extraction as a distinct task and shows small Rashomon ensembles lift recall over single models on human-annotated medical coding data with modest extra tokens.

read the letter

The main thing here is that they treat complete evidence extraction as its own problem, separate from the usual short sufficient explanations, and then test whether Rashomon-style ensembles of models can pull out more of the human-labeled evidence in medical coding. The case study finds that ensembles of three models already beat the best single model on recall while adding only a small number of extra tokens. That result lines up with a real need in regulated settings where missing supporting tokens can matter for compliance and trust in decision support systems. They keep the setup simple by using existing models, standard feature attributions, and a dataset with human annotations, which lets them measure directly against external ground truth rather than model self-reports. The empirical comparison is the part that lands cleanly: the ensembles recover tokens that individual models miss, and the overhead stays limited. This feels like a practical demonstration rather than a heavy theoretical lift. The softer area is the aggregation step. The abstract does not spell out the exact combination rule for the attributions or report precision on the added tokens against the human labels. If the method surfaces low-confidence tokens that humans did not mark as evidence, some of the recall gain could reflect noisier coverage instead of genuinely better evidence. The stress-test note flags this exact risk, and without details on statistical tests or controls for false positives it remains the least secured part of the claim. Even so, the work avoids circularity because the gains are checked against human annotations rather than fitted parameters. Readers working on explainable AI for clinical or regulated domains would get the most from the case study. It is targeted enough that someone focused on ensembles for better coverage or on evidence requirements in medicine would find it worth reading. The paper deserves peer review because the task is new and the results are concrete enough to evaluate in detail. Referees could usefully ask for the precise aggregation method and any precision numbers on the recovered tokens.

Referee Report

2 major / 2 minor

Summary. The paper formulates complete evidence extraction as a task requiring all relevant input tokens that support a decision, in contrast to minimal sufficient evidence. In a medical coding case study, it aggregates token-level feature attributions from Rashomon ensembles of off-the-shelf language models to improve recall over single models while incurring only modest token overhead. Ensembles of three models are reported to outperform the best individual model by recovering information missed by any one model, evaluated against human-annotated ground truth.

Significance. If the empirical results hold under more rigorous controls, the work offers a practical, training-free method to increase evidence completeness in high-stakes domains such as medical billing and regulatory compliance. The use of existing equally-performing models, human-annotated data, and direct comparison to ground truth is a clear strength that grounds the Rashomon-motivated aggregation in falsifiable outcomes rather than theoretical claims alone.

major comments (2)

[§4.2 and Table 2] §4.2 and Table 2: the reported recall lift (e.g., from best single-model ~0.65 to ensemble ~0.82) is not accompanied by precision or F1 on the additional tokens recovered relative to the human gold standard. Without these metrics the claim that the small token overhead reflects genuine complementary evidence rather than unvalidated false positives cannot be assessed.
[§3.2] §3.2: the aggregation operator (union, thresholded sum, or other) is described at a high level but lacks explicit specification of cross-model token alignment, attribution normalization, or handling of differing model vocabularies. This detail is load-bearing for reproducing the “small overhead” result and for confirming that false-positive inflation is controlled.

minor comments (2)

[Abstract] Abstract: the phrase “small token overhead” is not quantified (e.g., average added tokens or percentage increase); adding a concrete figure would improve immediate readability.
[§5] §5: the limitations section could explicitly discuss the reliability assumptions of the chosen feature attribution methods (e.g., Integrated Gradients or attention) across the ensemble members.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which highlight opportunities to strengthen the empirical claims and reproducibility of the work. We address each major point below and will revise the manuscript to incorporate additional metrics and explicit methodological details.

read point-by-point responses

Referee: [§4.2 and Table 2] §4.2 and Table 2: the reported recall lift (e.g., from best single-model ~0.65 to ensemble ~0.82) is not accompanied by precision or F1 on the additional tokens recovered relative to the human gold standard. Without these metrics the claim that the small token overhead reflects genuine complementary evidence rather than unvalidated false positives cannot be assessed.

Authors: We agree this is a valid gap. The current evaluation emphasizes recall for completeness and reports aggregate token overhead as an indirect control on precision, but does not isolate precision/F1 specifically on the incremental tokens added by the ensemble. In the revision we will add these metrics to Table 2 (and §4.2) by computing precision of the ensemble-only tokens against the human annotations, excluding tokens already recovered by the best single model. This will directly quantify whether the added evidence consists of true positives or false positives. revision: yes
Referee: [§3.2] §3.2: the aggregation operator (union, thresholded sum, or other) is described at a high level but lacks explicit specification of cross-model token alignment, attribution normalization, or handling of differing model vocabularies. This detail is load-bearing for reproducing the “small overhead” result and for confirming that false-positive inflation is controlled.

Authors: We will expand §3.2 with the missing implementation details. Token alignment is performed at the word level via character offsets after detokenization; attributions are min-max normalized independently per model to [0,1] before aggregation; the operator is a thresholded sum (threshold 0.3) followed by union. For differing vocabularies we map subword attributions to word level by averaging. Pseudocode and a worked example on a short input will be added to ensure exact reproducibility of the reported overhead. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical case study with external ground truth

full rationale

The paper conducts an empirical case study on complete evidence extraction for medical coding. It aggregates token attributions from existing off-the-shelf models motivated by the Rashomon effect and directly compares recall and token overhead against a human-annotated dataset. No equations, fitted parameters, or derivations are presented that reduce the reported gains to self-referential definitions or inputs by construction. Claims rest on measurable performance differences versus external gold-standard annotations rather than any self-citation chain or ansatz smuggling. This is a standard non-circular empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that feature attributions faithfully indicate token relevance and that simple aggregation across models yields a more complete set without excessive noise.

axioms (1)

domain assumption Feature attributions from language models accurately reflect the contribution of each token to the model's coding decision.
The method uses these attributions to extract evidence; if they are noisy or misaligned with human judgment, the ensemble gains may not hold.

pith-pipeline@v0.9.0 · 5425 in / 1154 out tokens · 30517 ms · 2026-05-17T23:36:18.151340+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a case study using existing language models and a medical dataset which contains human-annotated complete evidence. Our findings show that an ensemble approach, aggregating evidence from several models, improves evidence recall over individual models.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Rashomon ensembles significantly increase evidence recall while incurring only a small token overhead

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

On the diversity and limits of human explana tions

Chenhao Tan. On the diversity and limits of human explana tions. In NAACL, 2022

work page 2022
[2]

Rationali zing neural predictions

Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationali zing neural predictions. In EMNLP, 2016

work page 2016
[3]

FEVER: a large-scale dataset for fact extraction and VERiﬁcation

James Thorne, Andreas Vlachos, Christos Christodoulop oulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERiﬁcation. In NAACL, 2018

work page 2018
[4]

When stability meets suﬃ ciency: Informative expla- nations that do not overwhelm

Ronny Luss and Amit Dhurandhar. When stability meets suﬃ ciency: Informative expla- nations that do not overwhelm. TMLR, 2024

work page 2024
[5]

Limitations of feature attribution in l ong text classiﬁcation of standards

Katharina Beckh, Joann Rachel Jacob, Adrian Seeliger, S tefan R¨ uping, and Na- jmeh Mousavi Nejad. Limitations of feature attribution in l ong text classiﬁcation of standards. In Proceedings of the AAAI Symposium Series , volume 4, 2024

work page 2024
[6]

Towards formalising AI readiness of standards

Anna Schmitz, Rebekka G¨ orge, Elena Haedecke, Marion Bo rowski, Adrian Seeliger, and Maximilian Poretschkin. Towards formalising AI readiness of standards. In Digital Gov- ernance: Confronting the Challenges Posed by Artiﬁcial Int elligence. Springer, 2024

work page 2024
[7]

A new case-mix based payment system for the psychiatric day care sector in switze rland: proposed methods for developing the tariﬀ structure

Samuel Noll, Sarah Haag, R´ emi Guidon, and Simon H¨ olzer . A new case-mix based payment system for the psychiatric day care sector in switze rland: proposed methods for developing the tariﬀ structure. Health Policy , 131, 2023

work page 2023
[8]

MDACE: MIMIC documents annotated with code evidence

Hua Cheng, Rana Jafari, April Russell, Russell Klopfer, Edmond Lu, Benjamin Striner, and Matthew Gormley. MDACE: MIMIC documents annotated with code evidence. In ACL, 2023

work page 2023
[9]

W allace

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric L ehman, Caiming Xiong, Richard Socher, and Byron C. W allace. ERASER: A benchmark to evaluate rationalized NLP models. In ACL, 2020

work page 2020
[10]

Statistical modeling: The two cultures

Leo Breiman. Statistical modeling: The two cultures. Statistical science, 16(3), 2001

work page 2001
[11]

Se ltzer, Ronald Parr, Jiachang Liu, Srikar Katta, Jon Donnelly, Harry Chen, and Zachery Bon er

Cynthia Rudin, Chudi Zhong, Lesia Semenova, Margo I. Se ltzer, Ronald Parr, Jiachang Liu, Srikar Katta, Jon Donnelly, Harry Chen, and Zachery Bon er. Amazing things come from having many good models. In ICML, 2024

work page 2024
[12]

An empirical evaluation of the rasho mon eﬀect in explainable machine learning

Sebastian M¨ uller, Vanessa Toborek, Katharina Beckh, Matthias Jakobs, Christian Bauck- hage, and Pascal W elke. An empirical evaluation of the rasho mon eﬀect in explainable machine learning. In ECML. Springer, 2023

work page 2023
[13]

An unsupervised approach to achieve su pervised-level explainabil- ity in healthcare records

Joakim Edin, Maria Maistro, Lars Maaløe, Lasse Borghol t, Jakob Drachmann Havtorn, and Tuukka Ruotsalo. An unsupervised approach to achieve su pervised-level explainabil- ity in healthcare records. In EMNLP, 2024

work page 2024
[14]

PhysioBank, PhysioToolkit, and PhysioNet: compo nents of a new research re- source for complex physiologic signals

Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeﬀrey M Ha usdorﬀ, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Pen g, and H Eugene Stanley. PhysioBank, PhysioToolkit, and PhysioNet: compo nents of a new research re- source for complex physiologic signals. Circulation, 101(23), 2000

work page 2000
[15]

MIMIC-III, a freely accessible critical care databas e

Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Leh man, Mengling Feng, Mo- hammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anth ony Celi, and Roger G Mark. MIMIC-III, a freely accessible critical care databas e. Scientiﬁc Data , 3(1), 2016

work page 2016
[16]

The anatomy of evidence: An investigation into exp lainable ICD coding

Katharina Beckh, Elisa Studeny, Sujan Sai Gannamaneni , Dario Antweiler, and Stefan Rueping. The anatomy of evidence: An investigation into exp lainable ICD coding. In ACL Findings , 2025

work page 2025
[17]

Data quality in clinical coding: A critical analysis and preliminary study

Supriya Khadka, Xiaorui Jiang, and Vasile Palade. Data quality in clinical coding: A critical analysis and preliminary study. medRxiv, 2025

work page 2025
[18]

Impr oving adversarial ro- bustness via promoting ensemble diversity

Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. Impr oving adversarial ro- bustness via promoting ensemble diversity. In ICML, volume 97, 2019

work page 2019

[1] [1]

On the diversity and limits of human explana tions

Chenhao Tan. On the diversity and limits of human explana tions. In NAACL, 2022

work page 2022

[2] [2]

Rationali zing neural predictions

Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationali zing neural predictions. In EMNLP, 2016

work page 2016

[3] [3]

FEVER: a large-scale dataset for fact extraction and VERiﬁcation

James Thorne, Andreas Vlachos, Christos Christodoulop oulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERiﬁcation. In NAACL, 2018

work page 2018

[4] [4]

When stability meets suﬃ ciency: Informative expla- nations that do not overwhelm

Ronny Luss and Amit Dhurandhar. When stability meets suﬃ ciency: Informative expla- nations that do not overwhelm. TMLR, 2024

work page 2024

[5] [5]

Limitations of feature attribution in l ong text classiﬁcation of standards

Katharina Beckh, Joann Rachel Jacob, Adrian Seeliger, S tefan R¨ uping, and Na- jmeh Mousavi Nejad. Limitations of feature attribution in l ong text classiﬁcation of standards. In Proceedings of the AAAI Symposium Series , volume 4, 2024

work page 2024

[6] [6]

Towards formalising AI readiness of standards

Anna Schmitz, Rebekka G¨ orge, Elena Haedecke, Marion Bo rowski, Adrian Seeliger, and Maximilian Poretschkin. Towards formalising AI readiness of standards. In Digital Gov- ernance: Confronting the Challenges Posed by Artiﬁcial Int elligence. Springer, 2024

work page 2024

[7] [7]

A new case-mix based payment system for the psychiatric day care sector in switze rland: proposed methods for developing the tariﬀ structure

Samuel Noll, Sarah Haag, R´ emi Guidon, and Simon H¨ olzer . A new case-mix based payment system for the psychiatric day care sector in switze rland: proposed methods for developing the tariﬀ structure. Health Policy , 131, 2023

work page 2023

[8] [8]

MDACE: MIMIC documents annotated with code evidence

Hua Cheng, Rana Jafari, April Russell, Russell Klopfer, Edmond Lu, Benjamin Striner, and Matthew Gormley. MDACE: MIMIC documents annotated with code evidence. In ACL, 2023

work page 2023

[9] [9]

W allace

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric L ehman, Caiming Xiong, Richard Socher, and Byron C. W allace. ERASER: A benchmark to evaluate rationalized NLP models. In ACL, 2020

work page 2020

[10] [10]

Statistical modeling: The two cultures

Leo Breiman. Statistical modeling: The two cultures. Statistical science, 16(3), 2001

work page 2001

[11] [11]

Se ltzer, Ronald Parr, Jiachang Liu, Srikar Katta, Jon Donnelly, Harry Chen, and Zachery Bon er

Cynthia Rudin, Chudi Zhong, Lesia Semenova, Margo I. Se ltzer, Ronald Parr, Jiachang Liu, Srikar Katta, Jon Donnelly, Harry Chen, and Zachery Bon er. Amazing things come from having many good models. In ICML, 2024

work page 2024

[12] [12]

An empirical evaluation of the rasho mon eﬀect in explainable machine learning

Sebastian M¨ uller, Vanessa Toborek, Katharina Beckh, Matthias Jakobs, Christian Bauck- hage, and Pascal W elke. An empirical evaluation of the rasho mon eﬀect in explainable machine learning. In ECML. Springer, 2023

work page 2023

[13] [13]

An unsupervised approach to achieve su pervised-level explainabil- ity in healthcare records

Joakim Edin, Maria Maistro, Lars Maaløe, Lasse Borghol t, Jakob Drachmann Havtorn, and Tuukka Ruotsalo. An unsupervised approach to achieve su pervised-level explainabil- ity in healthcare records. In EMNLP, 2024

work page 2024

[14] [14]

PhysioBank, PhysioToolkit, and PhysioNet: compo nents of a new research re- source for complex physiologic signals

Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeﬀrey M Ha usdorﬀ, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Pen g, and H Eugene Stanley. PhysioBank, PhysioToolkit, and PhysioNet: compo nents of a new research re- source for complex physiologic signals. Circulation, 101(23), 2000

work page 2000

[15] [15]

MIMIC-III, a freely accessible critical care databas e

Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Leh man, Mengling Feng, Mo- hammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anth ony Celi, and Roger G Mark. MIMIC-III, a freely accessible critical care databas e. Scientiﬁc Data , 3(1), 2016

work page 2016

[16] [16]

The anatomy of evidence: An investigation into exp lainable ICD coding

Katharina Beckh, Elisa Studeny, Sujan Sai Gannamaneni , Dario Antweiler, and Stefan Rueping. The anatomy of evidence: An investigation into exp lainable ICD coding. In ACL Findings , 2025

work page 2025

[17] [17]

Data quality in clinical coding: A critical analysis and preliminary study

Supriya Khadka, Xiaorui Jiang, and Vasile Palade. Data quality in clinical coding: A critical analysis and preliminary study. medRxiv, 2025

work page 2025

[18] [18]

Impr oving adversarial ro- bustness via promoting ensemble diversity

Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. Impr oving adversarial ro- bustness via promoting ensemble diversity. In ICML, volume 97, 2019

work page 2019