pith. sign in

arxiv: 2605.17755 · v1 · pith:E777OQ26new · submitted 2026-05-18 · 💻 cs.CL · cs.AI

Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes

Pith reviewed 2026-05-20 11:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords clinical codingICD-9ICD-10multi-version trainingrare codeslabel-wise attentionmedical NLPlong-tail problem
0
0 comments X

The pith

Combining ICD-9 and ICD-10 training data raises micro F1 on rare ICD-10 codes by 27 percent without any code mapping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Clinical coding turns free-text notes into standardized ICD codes, but new versions arrive regularly and rare codes remain difficult to predict. The paper tests whether a single model can be trained on annotations from both ICD-9 and ICD-10 at once. Despite differences in code definitions and granularity, the combined data improves ICD-10 prediction. The gain is largest for the long tail of rare codes, and the same approach also lifts performance on frequent codes while using fewer parameters.

Core claim

A modified label-wise attention model trained on mixed ICD-9 and ICD-10 data outperforms an ICD-10-only model on ICD-10 prediction. For roughly 18,000 rare ICD-10 codes the micro F1 score rises 27 percent; for 8,000 frequent codes macro metrics also improve. These gains occur without explicit alignment between the two code sets and with a smaller total parameter count.

What carries the argument

Label-wise attention model trained on pooled ICD-9 and ICD-10 annotations that learns shared representations across versions without mapping steps.

If this is right

  • Historical ICD-9 data can be reused to improve current ICD-10 models without version-specific retraining.
  • The long-tail problem in clinical coding becomes less severe when older annotations supplement newer ones.
  • Fewer parameters suffice for strong coverage of both rare and common codes when multi-version data is used.
  • Models may generalize across future ICD releases if the same joint-training pattern continues to work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Semantic overlap between versions appears large enough that explicit cross-version mapping may often be unnecessary.
  • The same joint-training idea could be tested on other evolving medical terminologies such as SNOMED CT or procedure codes.
  • Regions or hospitals that still hold large ICD-9 archives could immediately improve their ICD-10 systems without new labeling campaigns.

Load-bearing premise

The attention model can extract useful shared signals from ICD-9 and ICD-10 labels even though the two versions differ in definition, granularity, and annotation habits.

What would settle it

If a model trained only on ICD-10 data matches or exceeds the combined model's micro F1 on the 18,000 rare ICD-10 codes, the claimed benefit of multi-version training would not hold.

Figures

Figures reproduced from arXiv: 2605.17755 by Anthony Nguyen, Jinghui Liu.

Figure 1
Figure 1. Figure 1: (a) ICD coding faces two intertwined chal￾lenges: the ICD system evolves continuously and the code distribution is heavily long-tailed. (b) We mix three MIMIC-derived datasets spanning ICD-9 and ICD￾10 to train a single version-agnostic model, DUAL￾LAAT. (c) Adding ICD-9 to ICD-10 training yields a 27% relative gain in micro F1 on rare ICD-10 codes than training on ICD-10 alone. Current best-performing ICD… view at source ↗
read the original abstract

Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper examines training a modified label-wise attention model for ICD-10 code prediction by augmenting ICD-10 training data with ICD-9 annotations. It claims that, despite version differences in code definitions and granularity, this multi-version approach yields a 27% micro-F1 improvement on 18K rare ICD-10 codes relative to ICD-10-only training, plus macro-metric gains on 8K frequent codes, all while using substantially fewer parameters.

Significance. If the reported gains are attributable to cross-version representation sharing in the label-wise attention layers rather than simply to increased training volume, the result would be practically relevant for clinical coding systems that must accommodate ICD version transitions and long-tail code distributions. The work directly targets a known bottleneck in deployable medical NLP models.

major comments (3)
  1. [Abstract] Abstract: the headline 27% micro-F1 lift on 18K rare codes is presented without any baseline model specification, statistical significance test, or explicit frequency threshold used to define the rare-code set; these omissions leave the central empirical claim only weakly supported.
  2. [Experiments] Experiments (assumed §4): no size-matched ablation is reported that adds an equivalent number of additional ICD-10-only examples to the training set; without this control, it is impossible to separate the benefit of multi-version training from the simple effect of larger training corpus size.
  3. [Methods] Methods (assumed §3): the description of the label-wise attention architecture does not specify whether ICD-9 and ICD-10 code embeddings or attention parameters are shared or kept separate, which is load-bearing for the claim that the model learns transferable cross-version representations without explicit mapping.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a concise statement of the exact dataset sizes (number of notes or tokens) for the ICD-9 and ICD-10 portions.
  2. [Results] Notation for micro-F1 versus macro-F1 should be introduced once and used consistently when reporting results on rare versus frequent codes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, providing clarifications from the manuscript and committing to revisions where they will strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline 27% micro-F1 lift on 18K rare codes is presented without any baseline model specification, statistical significance test, or explicit frequency threshold used to define the rare-code set; these omissions leave the central empirical claim only weakly supported.

    Authors: We agree that the abstract would be strengthened by including these details for self-containment. The baseline is the label-wise attention model trained on ICD-10 data alone, as described in Section 4. Statistical significance of the improvements is assessed in the results (Section 4.3). The rare-code set is defined via a frequency threshold in the experimental setup (Section 4.2), yielding the reported 18K codes. We will revise the abstract to explicitly reference the baseline, note the significance testing, and state the frequency threshold used. revision: yes

  2. Referee: [Experiments] Experiments (assumed §4): no size-matched ablation is reported that adds an equivalent number of additional ICD-10-only examples to the training set; without this control, it is impossible to separate the benefit of multi-version training from the simple effect of larger training corpus size.

    Authors: This is a fair and important point for isolating the contribution of cross-version data. While Table 1 reports the differing training set sizes and the gains are most pronounced on rare codes (where additional volume alone would be less impactful), a direct size-matched control is absent. We will add this ablation in the revised experiments section by augmenting the ICD-10-only training set with an equivalent volume of additional ICD-10 examples to match the multi-version corpus size. revision: yes

  3. Referee: [Methods] Methods (assumed §3): the description of the label-wise attention architecture does not specify whether ICD-9 and ICD-10 code embeddings or attention parameters are shared or kept separate, which is load-bearing for the claim that the model learns transferable cross-version representations without explicit mapping.

    Authors: We appreciate the referee highlighting this for clarity. In the modified label-wise attention model (Section 3), the attention parameters are shared across versions to enable transferable representations, while code embeddings remain version-specific to accommodate differences in definitions and granularity. We will add an explicit statement in the methods section detailing this sharing strategy to make the architecture unambiguous. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison on held-out data

full rationale

The paper reports experimental results from training a label-wise attention model on combined ICD-9/ICD-10 data versus ICD-10 alone, then measuring micro-F1 and macro metrics on a held-out ICD-10 test set. No equations, derivations, or self-referential definitions appear in the abstract or described setup. Performance numbers are direct outputs of standard train/test splits and are externally falsifiable; they do not reduce to any fitted quantity defined by the same data or to a self-citation chain. The central claim therefore remains an independent empirical observation rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities beyond the generic assumption that the attention model can ingest mixed-version labels.

pith-pipeline@v0.9.0 · 5701 in / 1040 out tokens · 41997 ms · 2026-05-20T11:51:25.870984+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    Code like humans: A multi-agent solution for medical coding

    Motzfeldt, Andreas Geert and Edin, Joakim and Christensen, Casper L and Hardmeier, Christian and Maaløe, Lars and Rogers, Anna. Code like humans: A multi-agent solution for medical coding. Findings of the Association for Computational Linguistics: EMNLP 2025

  2. [2]

    ICD -11: an international classification of diseases for the twenty-first century

    Harrison, James E and Weber, Stefanie and Jakob, Robert and Chute, Christopher G. ICD -11: an international classification of diseases for the twenty-first century. BMC Medical Informatics and Decision Making

  3. [3]

    MedDCR : Learning to design agentic workflows for medical coding

    Zheng, Jiyang and Nassar, Islam and Vu, Thanh and Zhong, Xu and Lin, Yang and Liu, Tongliang and Duong, Long and Li, Yuan-Fang. MedDCR : Learning to design agentic workflows for medical coding. arXiv [cs.AI]. arXiv:2511.13361

  4. [4]

    Improving rare and common ICD coding via a multi-agent LLM -based approach

    Li, Rumeng and Wang, Xun and Yu, Hong. Improving rare and common ICD coding via a multi-agent LLM -based approach. Proceedings of the 34th ACM International Conference on Information and Knowledge Management

  5. [5]

    An unsupervised approach to achieve supervised-level explainability in healthcare records

    Edin, Joakim and Maistro, Maria and Maaløe, Lars and Borgholt, Lasse and Havtorn, Jakob Drachmann and Ruotsalo, Tuukka. An unsupervised approach to achieve supervised-level explainability in healthcare records. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

  6. [6]

    A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics

    He, Kai and Mao, Rui and Lin, Qika and Ruan, Yucheng and Lan, Xiang and Feng, Mengling and Cambria, Erik. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. An International Journal on Information Fusion

  7. [7]

    Surpassing GPT- 4 medical coding with a two-stage approach

    Yang, Zhichao and Batra, Sanjit Singh and Stremmel, Joel and Halperin, Eran. Surpassing GPT -4 medical coding with a two-stage approach. arXiv [cs.CL]. arXiv:2311.13735

  8. [8]

    On the cross-lingual transferability of monolingual representations

    Artetxe, Mikel and Ruder, Sebastian and Yogatama, Dani. On the cross-lingual transferability of monolingual representations. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

  9. [9]

    Feasibility of replacing the ICD -10- CM with the ICD -11 for morbidity coding: A content analysis

    Fung, Kin Wah and Xu, Julia and McConnell-Lamptey, Shannon and Pickett, Donna and Bodenreider, Olivier. Feasibility of replacing the ICD -10- CM with the ICD -11 for morbidity coding: A content analysis. Journal of the American Medical Informatics Association

  10. [10]

    CoRelation : Boosting Automatic ICD Coding through Contextualized Code Relation Learning

    Luo, Junyu and Wang, Xiaochen and Wang, Jiaqi and Chang, Aofei and Wang, Yaqing and Ma, Fenglong. CoRelation : Boosting Automatic ICD Coding through Contextualized Code Relation Learning. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

  11. [11]

    A Label Attention Model for ICD Coding from Clinical Text

    Vu, Thanh and Nguyen, Dat Quoc and Nguyen, Anthony. A Label Attention Model for ICD Coding from Clinical Text. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20

  12. [12]

    Explainable Prediction of Medical Codes from Clinical Text

    Mullenbach, James and Wiegreffe, Sarah and Duke, Jon and Sun, Jimeng and Eisenstein, Jacob. Explainable Prediction of Medical Codes from Clinical Text. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

  13. [13]

    The Clinician and Dataset Shift in Artificial Intelligence

    Finlayson, Samuel G and Subbaswamy, Adarsh and Singh, Karandeep and Bowers, John and Kupke, Annabel and Zittrain, Jonathan and Kohane, Isaac S and Saria, Suchi. The Clinician and Dataset Shift in Artificial Intelligence. The New England journal of medicine

  14. [14]

    Automated Medical Coding on MIMIC - III and MIMIC - IV : A Critical Review and Replicability Study

    Edin, Joakim and Junge, Alexander and Havtorn, Jakob D and Borgholt, Lasse and Maistro, Maria and Ruotsalo, Tuukka and Maaløe, Lars. Automated Medical Coding on MIMIC - III and MIMIC - IV : A Critical Review and Replicability Study. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

  15. [15]

    Extracting international classification of diseases codes from clinical documentation using large language models

    Simmons, Ashley and Takkavatakarn, Kullaya and McDougal, Megan and Dilcher, Brian and Pincavitch, Jami and Meadows, Lukas and Kauffman, Justin and Klang, Eyal and Wig, Rebecca and Smith, Gordon and Soroush, Ali and Freeman, Robert and Apakama, Donald J and Charney, Alexander W and Kohli-Seth, Roopa and Nadkarni, Girish N and Sakhuja, Ankit. Extracting int...

  16. [16]

    Deep learning for automatic ICD coding: Review, opportunities and challenges

    Li, Xiaobo and Zhang, Yijia and Hou, Xiaodi and Wang, Shilong and Lin, Hongfei. Deep learning for automatic ICD coding: Review, opportunities and challenges. Artificial intelligence in medicine

  17. [17]

    Automated clinical coding: what, why, and where we are?

    Dong, Hang and Falis, Matúš and Whiteley, William and Alex, Beatrice and Matterson, Joshua and Ji, Shaoxiong and Chen, Jiaoyan and Wu, Honghan. Automated clinical coding: what, why, and where we are?. NPJ digital medicine

  18. [18]

    A survey on clinical natural language processing in the United Kingdom from 2007 to 2022

    Wu, Honghan and Wang, Minhong and Wu, Jinge and Francis, Farah and Chang, Yun-Hsuan and Shavick, Alex and Dong, Hang and Poon, Michael T C and Fitzpatrick, Natalie and Levine, Adam P and Slater, Luke T and Handy, Alex and Karwath, Andreas and Gkoutos, Georgios V and Chelala, Claude and Shah, Anoop Dinesh and Stewart, Robert and Collier, Nigel and Alex, Be...

  19. [19]

    Beyond label attention: Transparency in language models for automated medical coding via dictionary learning

    Wu, John and Wu, David and Sun, Jimeng. Beyond label attention: Transparency in language models for automated medical coding via dictionary learning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

  20. [20]

    Combining classifiers in text categorization

    Larkey, Leah S and Croft, W Bruce. Combining classifiers in text categorization. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval

  21. [21]

    Code Synonyms Do Matter: Multiple Synonyms Matching Network for Automatic ICD Coding

    Yuan, Zheng and Tan, Chuanqi and Huang, Songfang. Code Synonyms Do Matter: Multiple Synonyms Matching Network for Automatic ICD Coding. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

  22. [22]

    A systematic literature review of automated clinical coding and classification systems

    Stanfill, Mary H and Williams, Margaret and Fenton, Susan H and Jenders, Robert A and Hersh, William R. A systematic literature review of automated clinical coding and classification systems. Journal of the American Medical Informatics Association: JAMIA

  23. [23]

    Towards Automated ICD Coding Using Deep Learning

    Shi, Haoran and Xie, Pengtao and Hu, Zhiting and Zhang, Ming and Xing, Eric P. Towards Automated ICD Coding Using Deep Learning. arXiv [cs.CL]. arXiv:1711.04075

  24. [24]

    MIMIC- IV , a freely accessible electronic health record dataset

    Johnson, Alistair E W and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J and Moody, Benjamin and Gow, Brian and Lehman, Li-Wei H and Celi, Leo A and Mark, Roger G. MIMIC- IV , a freely accessible electronic health record dataset. Scientific data

  25. [25]

    PLM - ICD : Automatic ICD Coding with Pretrained Language Models

    Huang, Chao-Wei and Tsai, Shang-Chi and Chen, Yun-Nung. PLM - ICD : Automatic ICD Coding with Pretrained Language Models. Proceedings of the 4th Clinical Natural Language Processing Workshop

  26. [26]

    and Zimlichman Eyal and Barash Yiftach and Freeman Robert and Charney Alexander W

    Soroush Ali and Glicksberg Benjamin S. and Zimlichman Eyal and Barash Yiftach and Freeman Robert and Charney Alexander W. and Nadkarni Girish N and Klang Eyal. Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying. NEJM AI

  27. [27]

    Attention Is All You Need

    Vaswani, Ashish and Shazeer, Noam M and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia. Attention is All you Need. Neural Information Processing Systems. 1706.03762

  28. [28]

    A unified review of deep learning for automated medical coding

    Ji, Shaoxiong and Li, Xiaobo and Sun, Wei and Dong, Hang and Taalas, Ara and Zhang, Yijia and Wu, Honghan and Pitkänen, Esa and Marttinen, Pekka. A unified review of deep learning for automated medical coding. ACM computing surveys

  29. [29]

    Aligning AI research with the needs of clinical coding workflows: Eight recommendations based on US data analysis and critical review

    Gan, Yidong and Rybinski, Maciej and Hachey, Ben and Kummerfeld, Jonathan K. Aligning AI research with the needs of clinical coding workflows: Eight recommendations based on US data analysis and critical review. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  30. [30]

    Less is more: Explainable and efficient ICD code prediction with clinical entities

    Douglas, James C and Gan, Yidong and Hachey, Ben and Kummerfeld, Jonathan K. Less is more: Explainable and efficient ICD code prediction with clinical entities. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

  31. [31]

    Automated ICD coding using extreme multi-label long text transformer-based models

    Liu, Leibo and Perez-Concha, Oscar and Nguyen, Anthony and Bennett, Vicki and Jorm, Louisa. Automated ICD coding using extreme multi-label long text transformer-based models. Artificial intelligence in medicine

  32. [32]

    MDACE : MIMIC documents annotated with code evidence

    Cheng, Hua and Jafari, Rana and Russell, April and Klopfer, Russell and Lu, Edmond and Striner, Benjamin and Gormley, Matthew. MDACE : MIMIC documents annotated with code evidence. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)