From Documents to Spans: Scalable Supervision for Evidence-Based ICD Coding with LLMs
Pith reviewed 2026-05-15 10:20 UTC · model grok-4.3
The pith
Span supervision lets LLMs learn evidence patterns for ICD codes from short segments and transfer them to full clinical documents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Span-Centric Learning strengthens LLMs at identifying, aggregating, and assigning codes from evidence spans using a small set of annotated documents plus many lightweight spans, then transfers the capability to generate evidence-grounded predictions on full clinical documents.
What carries the argument
Span-Centric Learning (SCL), a framework that supervises evidence recognition and code assignment first at the span level before scaling to documents.
If this is right
- Under the Llama3.1-8B backbone the method raises macro-F1 by 8.2 points over standard supervised fine-tuning.
- Training cost drops to 20 percent of the cost of standard SFT while still producing evidence-linked outputs.
- Each predicted code is accompanied by explicit supporting text spans that humans can audit and revise.
Where Pith is reading between the lines
- The approach may reduce overfitting to document-wide patterns and encourage focus on clinically local cues.
- Synthesizing additional spans from existing data could scale the method further with minimal new human labeling.
- Analogous span-centric supervision might extend to other evidence-requiring document tasks such as legal or regulatory text classification.
Load-bearing premise
Evidence patterns learned from short local spans transfer reliably to locating and combining evidence across entire clinical documents.
What would settle it
Measure whether span-trained models lose their accuracy advantage on documents whose supporting evidence for a code is split across distant sections rather than appearing in one localized region.
read the original abstract
International Classification of Diseases (ICD) coding assigns diagnosis codes to clinical documents and is essential for healthcare billing and clinical analysis. Reliable coding requires that each predicted code be supported by explicit textual evidence. However, existing public datasets provide only code labels, without evidence annotations, limiting models' ability to learn evidence-grounded predictions. In this work, we argue that dense, document-level evidence annotation is not always necessary for learning evidence-based coding. Instead, models can learn code-specific evidence patterns from local spans and use these patterns to support document-level evidence-based coding. Based on this insight, we propose Span-Centric Learning (SCL), a training framework that strengthens LLMs' coding ability at the span level and transfers this capability to full clinical documents. Specifically, we use a small set of annotated documents to supervise evidence recognition, aggregation, and code assignment, while leveraging a large collection of lightweight evidence spans to reinforce span-level reasoning. Due to their compactness, span annotations are scalable and can be further augmented through synthesis. Under the same Llama3.1-8B backbone, our approach achieves an 8.2-point improvement in macro-F1 at only 20% of the training cost of standard SFT, and provides explicit supporting evidence for each predicted code, enabling human auditing and revision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Span-Centric Learning (SCL), a training framework for LLMs in evidence-based ICD coding. It posits that models can learn code-specific evidence patterns from local spans using a small set of annotated documents for supervision of evidence recognition, aggregation, and code assignment, combined with a large collection of lightweight evidence spans that can be augmented synthetically. The key empirical claim is that, using the Llama3.1-8B backbone, SCL achieves an 8.2-point macro-F1 improvement over standard SFT while using only 20% of the training cost, and generates explicit supporting evidence for each code prediction to enable auditing.
Significance. Should the empirical results prove robust, this work would represent a meaningful advance in scalable, explainable clinical coding. The reduction in annotation burden through span-based supervision and the provision of evidence spans could facilitate wider adoption of LLMs in healthcare billing and analysis, where both accuracy and interpretability are critical. The approach addresses a practical bottleneck in existing datasets that lack evidence annotations.
major comments (3)
- [Abstract] Abstract: the central performance claim of an 8.2-point macro-F1 gain at 20% training cost (relative to standard SFT on the same Llama3.1-8B backbone) is presented without reference to any table, figure, baseline details, data splits, or error bars, which is load-bearing for verifying the result and attributing gains to the proposed method.
- [§4 (Experiments)] §4 (Experiments): no isolating ablation is reported that compares full SCL against a control using only the small annotated-document set (without span reinforcement or synthesis) on the identical Llama3.1-8B backbone; this separation is required to substantiate the span-to-document transfer assumption and the efficiency claim.
- [§3 (Method)] §3 (Method): the premise that local span patterns aggregate correctly to handle document-global phenomena (negation scope, temporal relations, differential diagnosis) is stated but not tested with targeted counter-examples or error analysis on cases requiring cross-sentence context.
minor comments (2)
- [Abstract] Abstract: the acronym 'Span-Centric Learning (SCL)' is introduced without a one-sentence definition, which would aid readers before the method section.
- Throughout: notation for evidence spans versus full documents is not always distinguished typographically, occasionally making it unclear whether a quantity refers to span-level or document-level supervision.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential of Span-Centric Learning to reduce annotation burden while improving explainability in ICD coding. We address each major comment below and outline targeted revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claim of an 8.2-point macro-F1 gain at 20% training cost (relative to standard SFT on the same Llama3.1-8B backbone) is presented without reference to any table, figure, baseline details, data splits, or error bars, which is load-bearing for verifying the result and attributing gains to the proposed method.
Authors: We agree that the abstract should enable direct verification of the key empirical claim. In the revised version we will explicitly reference the relevant table and figure (currently Table 2 and Figure 3) that report the macro-F1 scores, training FLOPs, data splits, and standard deviations across three random seeds for the Llama3.1-8B backbone. This change will make the 8.2-point improvement and 20% cost reduction traceable without altering the abstract's length or focus. revision: yes
-
Referee: [§4 (Experiments)] §4 (Experiments): no isolating ablation is reported that compares full SCL against a control using only the small annotated-document set (without span reinforcement or synthesis) on the identical Llama3.1-8B backbone; this separation is required to substantiate the span-to-document transfer assumption and the efficiency claim.
Authors: We acknowledge that an isolating ablation is necessary to separate the contribution of span-level reinforcement from the small annotated-document supervision alone. We will add this control experiment to §4.3 in the revision, training the identical Llama3.1-8B model on the small annotated set only and comparing it directly to full SCL on the same test split, reporting macro-F1, training cost, and evidence-span quality metrics. This will clarify the incremental benefit of the lightweight span data. revision: yes
-
Referee: [§3 (Method)] §3 (Method): the premise that local span patterns aggregate correctly to handle document-global phenomena (negation scope, temporal relations, differential diagnosis) is stated but not tested with targeted counter-examples or error analysis on cases requiring cross-sentence context.
Authors: We agree that explicit validation of cross-sentence aggregation is important. While the current results show strong overall performance, we will add a dedicated error-analysis subsection in the revised §4 that samples 100 documents containing negation, temporal, or differential-diagnosis phenomena. For each case we will report whether the model correctly aggregates the relevant spans and provide qualitative examples of both successful and failed aggregations. This analysis will be included without requiring new data collection. revision: partial
Circularity Check
No significant circularity; empirical result only
full rationale
The paper advances an empirical training framework (Span-Centric Learning) that combines a small set of document-level annotations with a larger set of span annotations to improve evidence-based ICD coding on Llama3.1-8B. The 8.2-point macro-F1 gain and evidence-provision capability are reported as measured experimental outcomes at reduced training cost, not as quantities derived by definition or by fitting a parameter that is then renamed a prediction. No equations, self-citations, uniqueness theorems, or ansatzes appear in the derivation chain that would make the claimed transfer from spans to documents tautological. The span-to-document transfer is presented as a testable insight rather than a self-referential premise.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Local span-level evidence patterns can be learned from limited annotations and transferred to support accurate document-level code assignment.
invented entities (1)
-
Span-Centric Learning (SCL)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
models can learn code-specific evidence patterns from local spans and use these patterns to support document-level evidence-based coding
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
span-level learning improves LLMs' ability to perform document-level ICD coding
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.