pith. sign in

arxiv: 2511.07055 · v3 · submitted 2025-11-10 · 💻 cs.CL · cs.IR· cs.LG

Complete Evidence Extraction with Model Ensembles: A Case Study on Medical Coding

Pith reviewed 2026-05-17 23:36 UTC · model grok-4.3

classification 💻 cs.CL cs.IRcs.LG
keywords complete evidence extractionRashomon ensemblesmedical codingfeature attributionmodel ensembleslanguage modelsexplainable AI
0
0 comments X

The pith

Aggregating token evidence from multiple language models recovers more complete supporting information for medical coding decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines complete evidence extraction as the task of identifying every input token that supports a decision, which is required for regulatory and billing purposes in medicine. It tests whether ensembles drawn from the Rashomon set of equally accurate models can combine their feature attributions to capture evidence that any one model misses. Results show that such ensembles raise recall of human-annotated evidence while adding only a modest number of extra tokens. Even ensembles of three models already surpass the strongest single model.

Core claim

Rashomon ensembles formed by aggregating token-level feature attributions across several language models that perform equally well on medical coding increase evidence recall substantially compared with any individual model, while the added token count stays small; ensembles of only three models already recover information missed by the best single model when measured against human gold-standard annotations.

What carries the argument

Rashomon ensembles that combine token attributions from multiple high-performing language models to assemble a fuller set of evidence tokens.

If this is right

  • Evidence recall rises significantly while token overhead stays small.
  • Ensembles of only three models already beat the best single model.
  • Information missed by any one model is recovered through aggregation.
  • The method supplies the fuller evidence sets needed for regulatory compliance in medical billing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same aggregation idea could be tested in other regulated domains that demand exhaustive rather than minimal evidence, such as legal contract review.
  • Selecting ensemble members by maximizing diversity in their attribution maps might further reduce the number of models needed.
  • Pairing the ensemble output with a lightweight human verification step could turn the added tokens into reliable audit trails.

Load-bearing premise

The feature attributions produced by each model accurately mark the tokens that truly support the code according to human judgment, and merging them does not add many irrelevant tokens.

What would settle it

A human re-annotation study on the extra tokens returned by the ensemble finds that most are judged irrelevant, or a larger test shows no recall gain beyond the best single model.

Figures

Figures reproduced from arXiv: 2511.07055 by Katharina Beckh, Stefan R\"uping, Sven Heuser.

Figure 1
Figure 1. Figure 1: Illustration of complete evidence extraction on the medical c [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Recall for different ensemble sizes including all possible model c [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

High-stakes decisions informed by decision support systems require explicit evidence. While prior work focuses on short sufficient evidence, regulatory compliance and medical billing call for complete evidence: all relevant input tokens that support a decision. We formulate complete evidence extraction as a task and study it in a medical coding setting. Motivated by the Rashomon effect, we aggregate token-level evidence from multiple language models to increase evidence completeness. We perform a case study using existing equally-performing models, feature attributions, and a dataset with human-annotated evidence. Our results show that Rashomon ensembles significantly increase evidence recall while incurring only a small token overhead over individual models. Ensembles of only three models already outperform the best single model and recover information that individual models miss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formulates complete evidence extraction as a task requiring all relevant input tokens that support a decision, in contrast to minimal sufficient evidence. In a medical coding case study, it aggregates token-level feature attributions from Rashomon ensembles of off-the-shelf language models to improve recall over single models while incurring only modest token overhead. Ensembles of three models are reported to outperform the best individual model by recovering information missed by any one model, evaluated against human-annotated ground truth.

Significance. If the empirical results hold under more rigorous controls, the work offers a practical, training-free method to increase evidence completeness in high-stakes domains such as medical billing and regulatory compliance. The use of existing equally-performing models, human-annotated data, and direct comparison to ground truth is a clear strength that grounds the Rashomon-motivated aggregation in falsifiable outcomes rather than theoretical claims alone.

major comments (2)
  1. [§4.2 and Table 2] §4.2 and Table 2: the reported recall lift (e.g., from best single-model ~0.65 to ensemble ~0.82) is not accompanied by precision or F1 on the additional tokens recovered relative to the human gold standard. Without these metrics the claim that the small token overhead reflects genuine complementary evidence rather than unvalidated false positives cannot be assessed.
  2. [§3.2] §3.2: the aggregation operator (union, thresholded sum, or other) is described at a high level but lacks explicit specification of cross-model token alignment, attribution normalization, or handling of differing model vocabularies. This detail is load-bearing for reproducing the “small overhead” result and for confirming that false-positive inflation is controlled.
minor comments (2)
  1. [Abstract] Abstract: the phrase “small token overhead” is not quantified (e.g., average added tokens or percentage increase); adding a concrete figure would improve immediate readability.
  2. [§5] §5: the limitations section could explicitly discuss the reliability assumptions of the chosen feature attribution methods (e.g., Integrated Gradients or attention) across the ensemble members.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which highlight opportunities to strengthen the empirical claims and reproducibility of the work. We address each major point below and will revise the manuscript to incorporate additional metrics and explicit methodological details.

read point-by-point responses
  1. Referee: [§4.2 and Table 2] §4.2 and Table 2: the reported recall lift (e.g., from best single-model ~0.65 to ensemble ~0.82) is not accompanied by precision or F1 on the additional tokens recovered relative to the human gold standard. Without these metrics the claim that the small token overhead reflects genuine complementary evidence rather than unvalidated false positives cannot be assessed.

    Authors: We agree this is a valid gap. The current evaluation emphasizes recall for completeness and reports aggregate token overhead as an indirect control on precision, but does not isolate precision/F1 specifically on the incremental tokens added by the ensemble. In the revision we will add these metrics to Table 2 (and §4.2) by computing precision of the ensemble-only tokens against the human annotations, excluding tokens already recovered by the best single model. This will directly quantify whether the added evidence consists of true positives or false positives. revision: yes

  2. Referee: [§3.2] §3.2: the aggregation operator (union, thresholded sum, or other) is described at a high level but lacks explicit specification of cross-model token alignment, attribution normalization, or handling of differing model vocabularies. This detail is load-bearing for reproducing the “small overhead” result and for confirming that false-positive inflation is controlled.

    Authors: We will expand §3.2 with the missing implementation details. Token alignment is performed at the word level via character offsets after detokenization; attributions are min-max normalized independently per model to [0,1] before aggregation; the operator is a thresholded sum (threshold 0.3) followed by union. For differing vocabularies we map subword attributions to word level by averaging. Pseudocode and a worked example on a short input will be added to ensure exact reproducibility of the reported overhead. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical case study with external ground truth

full rationale

The paper conducts an empirical case study on complete evidence extraction for medical coding. It aggregates token attributions from existing off-the-shelf models motivated by the Rashomon effect and directly compares recall and token overhead against a human-annotated dataset. No equations, fitted parameters, or derivations are presented that reduce the reported gains to self-referential definitions or inputs by construction. Claims rest on measurable performance differences versus external gold-standard annotations rather than any self-citation chain or ansatz smuggling. This is a standard non-circular empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that feature attributions faithfully indicate token relevance and that simple aggregation across models yields a more complete set without excessive noise.

axioms (1)
  • domain assumption Feature attributions from language models accurately reflect the contribution of each token to the model's coding decision.
    The method uses these attributions to extract evidence; if they are noisy or misaligned with human judgment, the ensemble gains may not hold.

pith-pipeline@v0.9.0 · 5425 in / 1154 out tokens · 30517 ms · 2026-05-17T23:36:18.151340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    On the diversity and limits of human explana tions

    Chenhao Tan. On the diversity and limits of human explana tions. In NAACL, 2022

  2. [2]

    Rationali zing neural predictions

    Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationali zing neural predictions. In EMNLP, 2016

  3. [3]

    FEVER: a large-scale dataset for fact extraction and VERification

    James Thorne, Andreas Vlachos, Christos Christodoulop oulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In NAACL, 2018

  4. [4]

    When stability meets suffi ciency: Informative expla- nations that do not overwhelm

    Ronny Luss and Amit Dhurandhar. When stability meets suffi ciency: Informative expla- nations that do not overwhelm. TMLR, 2024

  5. [5]

    Limitations of feature attribution in l ong text classification of standards

    Katharina Beckh, Joann Rachel Jacob, Adrian Seeliger, S tefan R¨ uping, and Na- jmeh Mousavi Nejad. Limitations of feature attribution in l ong text classification of standards. In Proceedings of the AAAI Symposium Series , volume 4, 2024

  6. [6]

    Towards formalising AI readiness of standards

    Anna Schmitz, Rebekka G¨ orge, Elena Haedecke, Marion Bo rowski, Adrian Seeliger, and Maximilian Poretschkin. Towards formalising AI readiness of standards. In Digital Gov- ernance: Confronting the Challenges Posed by Artificial Int elligence. Springer, 2024

  7. [7]

    A new case-mix based payment system for the psychiatric day care sector in switze rland: proposed methods for developing the tariff structure

    Samuel Noll, Sarah Haag, R´ emi Guidon, and Simon H¨ olzer . A new case-mix based payment system for the psychiatric day care sector in switze rland: proposed methods for developing the tariff structure. Health Policy , 131, 2023

  8. [8]

    MDACE: MIMIC documents annotated with code evidence

    Hua Cheng, Rana Jafari, April Russell, Russell Klopfer, Edmond Lu, Benjamin Striner, and Matthew Gormley. MDACE: MIMIC documents annotated with code evidence. In ACL, 2023

  9. [9]

    W allace

    Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric L ehman, Caiming Xiong, Richard Socher, and Byron C. W allace. ERASER: A benchmark to evaluate rationalized NLP models. In ACL, 2020

  10. [10]

    Statistical modeling: The two cultures

    Leo Breiman. Statistical modeling: The two cultures. Statistical science, 16(3), 2001

  11. [11]

    Se ltzer, Ronald Parr, Jiachang Liu, Srikar Katta, Jon Donnelly, Harry Chen, and Zachery Bon er

    Cynthia Rudin, Chudi Zhong, Lesia Semenova, Margo I. Se ltzer, Ronald Parr, Jiachang Liu, Srikar Katta, Jon Donnelly, Harry Chen, and Zachery Bon er. Amazing things come from having many good models. In ICML, 2024

  12. [12]

    An empirical evaluation of the rasho mon effect in explainable machine learning

    Sebastian M¨ uller, Vanessa Toborek, Katharina Beckh, Matthias Jakobs, Christian Bauck- hage, and Pascal W elke. An empirical evaluation of the rasho mon effect in explainable machine learning. In ECML. Springer, 2023

  13. [13]

    An unsupervised approach to achieve su pervised-level explainabil- ity in healthcare records

    Joakim Edin, Maria Maistro, Lars Maaløe, Lasse Borghol t, Jakob Drachmann Havtorn, and Tuukka Ruotsalo. An unsupervised approach to achieve su pervised-level explainabil- ity in healthcare records. In EMNLP, 2024

  14. [14]

    PhysioBank, PhysioToolkit, and PhysioNet: compo nents of a new research re- source for complex physiologic signals

    Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Ha usdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Pen g, and H Eugene Stanley. PhysioBank, PhysioToolkit, and PhysioNet: compo nents of a new research re- source for complex physiologic signals. Circulation, 101(23), 2000

  15. [15]

    MIMIC-III, a freely accessible critical care databas e

    Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Leh man, Mengling Feng, Mo- hammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anth ony Celi, and Roger G Mark. MIMIC-III, a freely accessible critical care databas e. Scientific Data , 3(1), 2016

  16. [16]

    The anatomy of evidence: An investigation into exp lainable ICD coding

    Katharina Beckh, Elisa Studeny, Sujan Sai Gannamaneni , Dario Antweiler, and Stefan Rueping. The anatomy of evidence: An investigation into exp lainable ICD coding. In ACL Findings , 2025

  17. [17]

    Data quality in clinical coding: A critical analysis and preliminary study

    Supriya Khadka, Xiaorui Jiang, and Vasile Palade. Data quality in clinical coding: A critical analysis and preliminary study. medRxiv, 2025

  18. [18]

    Impr oving adversarial ro- bustness via promoting ensemble diversity

    Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. Impr oving adversarial ro- bustness via promoting ensemble diversity. In ICML, volume 97, 2019