Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes
Pith reviewed 2026-05-20 11:51 UTC · model grok-4.3
The pith
Combining ICD-9 and ICD-10 training data raises micro F1 on rare ICD-10 codes by 27 percent without any code mapping.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A modified label-wise attention model trained on mixed ICD-9 and ICD-10 data outperforms an ICD-10-only model on ICD-10 prediction. For roughly 18,000 rare ICD-10 codes the micro F1 score rises 27 percent; for 8,000 frequent codes macro metrics also improve. These gains occur without explicit alignment between the two code sets and with a smaller total parameter count.
What carries the argument
Label-wise attention model trained on pooled ICD-9 and ICD-10 annotations that learns shared representations across versions without mapping steps.
If this is right
- Historical ICD-9 data can be reused to improve current ICD-10 models without version-specific retraining.
- The long-tail problem in clinical coding becomes less severe when older annotations supplement newer ones.
- Fewer parameters suffice for strong coverage of both rare and common codes when multi-version data is used.
- Models may generalize across future ICD releases if the same joint-training pattern continues to work.
Where Pith is reading between the lines
- Semantic overlap between versions appears large enough that explicit cross-version mapping may often be unnecessary.
- The same joint-training idea could be tested on other evolving medical terminologies such as SNOMED CT or procedure codes.
- Regions or hospitals that still hold large ICD-9 archives could immediately improve their ICD-10 systems without new labeling campaigns.
Load-bearing premise
The attention model can extract useful shared signals from ICD-9 and ICD-10 labels even though the two versions differ in definition, granularity, and annotation habits.
What would settle it
If a model trained only on ICD-10 data matches or exceeds the combined model's micro F1 on the 18,000 rare ICD-10 codes, the claimed benefit of multi-version training would not hold.
Figures
read the original abstract
Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines training a modified label-wise attention model for ICD-10 code prediction by augmenting ICD-10 training data with ICD-9 annotations. It claims that, despite version differences in code definitions and granularity, this multi-version approach yields a 27% micro-F1 improvement on 18K rare ICD-10 codes relative to ICD-10-only training, plus macro-metric gains on 8K frequent codes, all while using substantially fewer parameters.
Significance. If the reported gains are attributable to cross-version representation sharing in the label-wise attention layers rather than simply to increased training volume, the result would be practically relevant for clinical coding systems that must accommodate ICD version transitions and long-tail code distributions. The work directly targets a known bottleneck in deployable medical NLP models.
major comments (3)
- [Abstract] Abstract: the headline 27% micro-F1 lift on 18K rare codes is presented without any baseline model specification, statistical significance test, or explicit frequency threshold used to define the rare-code set; these omissions leave the central empirical claim only weakly supported.
- [Experiments] Experiments (assumed §4): no size-matched ablation is reported that adds an equivalent number of additional ICD-10-only examples to the training set; without this control, it is impossible to separate the benefit of multi-version training from the simple effect of larger training corpus size.
- [Methods] Methods (assumed §3): the description of the label-wise attention architecture does not specify whether ICD-9 and ICD-10 code embeddings or attention parameters are shared or kept separate, which is load-bearing for the claim that the model learns transferable cross-version representations without explicit mapping.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a concise statement of the exact dataset sizes (number of notes or tokens) for the ICD-9 and ICD-10 portions.
- [Results] Notation for micro-F1 versus macro-F1 should be introduced once and used consistently when reporting results on rare versus frequent codes.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, providing clarifications from the manuscript and committing to revisions where they will strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline 27% micro-F1 lift on 18K rare codes is presented without any baseline model specification, statistical significance test, or explicit frequency threshold used to define the rare-code set; these omissions leave the central empirical claim only weakly supported.
Authors: We agree that the abstract would be strengthened by including these details for self-containment. The baseline is the label-wise attention model trained on ICD-10 data alone, as described in Section 4. Statistical significance of the improvements is assessed in the results (Section 4.3). The rare-code set is defined via a frequency threshold in the experimental setup (Section 4.2), yielding the reported 18K codes. We will revise the abstract to explicitly reference the baseline, note the significance testing, and state the frequency threshold used. revision: yes
-
Referee: [Experiments] Experiments (assumed §4): no size-matched ablation is reported that adds an equivalent number of additional ICD-10-only examples to the training set; without this control, it is impossible to separate the benefit of multi-version training from the simple effect of larger training corpus size.
Authors: This is a fair and important point for isolating the contribution of cross-version data. While Table 1 reports the differing training set sizes and the gains are most pronounced on rare codes (where additional volume alone would be less impactful), a direct size-matched control is absent. We will add this ablation in the revised experiments section by augmenting the ICD-10-only training set with an equivalent volume of additional ICD-10 examples to match the multi-version corpus size. revision: yes
-
Referee: [Methods] Methods (assumed §3): the description of the label-wise attention architecture does not specify whether ICD-9 and ICD-10 code embeddings or attention parameters are shared or kept separate, which is load-bearing for the claim that the model learns transferable cross-version representations without explicit mapping.
Authors: We appreciate the referee highlighting this for clarity. In the modified label-wise attention model (Section 3), the attention parameters are shared across versions to enable transferable representations, while code embeddings remain version-specific to accommodate differences in definitions and granularity. We will add an explicit statement in the methods section detailing this sharing strategy to make the architecture unambiguous. revision: yes
Circularity Check
No circularity: purely empirical comparison on held-out data
full rationale
The paper reports experimental results from training a label-wise attention model on combined ICD-9/ICD-10 data versus ICD-10 alone, then measuring micro-F1 and macro metrics on a held-out ICD-10 test set. No equations, derivations, or self-referential definitions appear in the abstract or described setup. Performance numbers are direct outputs of standard train/test splits and are externally falsifiable; they do not reduce to any fitted quantity defined by the same data or to a self-citation chain. The central claim therefore remains an independent empirical observation rather than a tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Code like humans: A multi-agent solution for medical coding
Motzfeldt, Andreas Geert and Edin, Joakim and Christensen, Casper L and Hardmeier, Christian and Maaløe, Lars and Rogers, Anna. Code like humans: A multi-agent solution for medical coding. Findings of the Association for Computational Linguistics: EMNLP 2025
work page 2025
-
[2]
ICD -11: an international classification of diseases for the twenty-first century
Harrison, James E and Weber, Stefanie and Jakob, Robert and Chute, Christopher G. ICD -11: an international classification of diseases for the twenty-first century. BMC Medical Informatics and Decision Making
-
[3]
MedDCR : Learning to design agentic workflows for medical coding
Zheng, Jiyang and Nassar, Islam and Vu, Thanh and Zhong, Xu and Lin, Yang and Liu, Tongliang and Duong, Long and Li, Yuan-Fang. MedDCR : Learning to design agentic workflows for medical coding. arXiv [cs.AI]. arXiv:2511.13361
-
[4]
Improving rare and common ICD coding via a multi-agent LLM -based approach
Li, Rumeng and Wang, Xun and Yu, Hong. Improving rare and common ICD coding via a multi-agent LLM -based approach. Proceedings of the 34th ACM International Conference on Information and Knowledge Management
-
[5]
An unsupervised approach to achieve supervised-level explainability in healthcare records
Edin, Joakim and Maistro, Maria and Maaløe, Lars and Borgholt, Lasse and Havtorn, Jakob Drachmann and Ruotsalo, Tuukka. An unsupervised approach to achieve supervised-level explainability in healthcare records. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
work page 2024
-
[6]
He, Kai and Mao, Rui and Lin, Qika and Ruan, Yucheng and Lan, Xiang and Feng, Mengling and Cambria, Erik. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. An International Journal on Information Fusion
-
[7]
Surpassing GPT- 4 medical coding with a two-stage approach
Yang, Zhichao and Batra, Sanjit Singh and Stremmel, Joel and Halperin, Eran. Surpassing GPT -4 medical coding with a two-stage approach. arXiv [cs.CL]. arXiv:2311.13735
-
[8]
On the cross-lingual transferability of monolingual representations
Artetxe, Mikel and Ruder, Sebastian and Yogatama, Dani. On the cross-lingual transferability of monolingual representations. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
-
[9]
Feasibility of replacing the ICD -10- CM with the ICD -11 for morbidity coding: A content analysis
Fung, Kin Wah and Xu, Julia and McConnell-Lamptey, Shannon and Pickett, Donna and Bodenreider, Olivier. Feasibility of replacing the ICD -10- CM with the ICD -11 for morbidity coding: A content analysis. Journal of the American Medical Informatics Association
-
[10]
CoRelation : Boosting Automatic ICD Coding through Contextualized Code Relation Learning
Luo, Junyu and Wang, Xiaochen and Wang, Jiaqi and Chang, Aofei and Wang, Yaqing and Ma, Fenglong. CoRelation : Boosting Automatic ICD Coding through Contextualized Code Relation Learning. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
work page 2024
-
[11]
A Label Attention Model for ICD Coding from Clinical Text
Vu, Thanh and Nguyen, Dat Quoc and Nguyen, Anthony. A Label Attention Model for ICD Coding from Clinical Text. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20
-
[12]
Explainable Prediction of Medical Codes from Clinical Text
Mullenbach, James and Wiegreffe, Sarah and Duke, Jon and Sun, Jimeng and Eisenstein, Jacob. Explainable Prediction of Medical Codes from Clinical Text. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
work page 2018
-
[13]
The Clinician and Dataset Shift in Artificial Intelligence
Finlayson, Samuel G and Subbaswamy, Adarsh and Singh, Karandeep and Bowers, John and Kupke, Annabel and Zittrain, Jonathan and Kohane, Isaac S and Saria, Suchi. The Clinician and Dataset Shift in Artificial Intelligence. The New England journal of medicine
-
[14]
Automated Medical Coding on MIMIC - III and MIMIC - IV : A Critical Review and Replicability Study
Edin, Joakim and Junge, Alexander and Havtorn, Jakob D and Borgholt, Lasse and Maistro, Maria and Ruotsalo, Tuukka and Maaløe, Lars. Automated Medical Coding on MIMIC - III and MIMIC - IV : A Critical Review and Replicability Study. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
-
[15]
Simmons, Ashley and Takkavatakarn, Kullaya and McDougal, Megan and Dilcher, Brian and Pincavitch, Jami and Meadows, Lukas and Kauffman, Justin and Klang, Eyal and Wig, Rebecca and Smith, Gordon and Soroush, Ali and Freeman, Robert and Apakama, Donald J and Charney, Alexander W and Kohli-Seth, Roopa and Nadkarni, Girish N and Sakhuja, Ankit. Extracting int...
-
[16]
Deep learning for automatic ICD coding: Review, opportunities and challenges
Li, Xiaobo and Zhang, Yijia and Hou, Xiaodi and Wang, Shilong and Lin, Hongfei. Deep learning for automatic ICD coding: Review, opportunities and challenges. Artificial intelligence in medicine
-
[17]
Automated clinical coding: what, why, and where we are?
Dong, Hang and Falis, Matúš and Whiteley, William and Alex, Beatrice and Matterson, Joshua and Ji, Shaoxiong and Chen, Jiaoyan and Wu, Honghan. Automated clinical coding: what, why, and where we are?. NPJ digital medicine
-
[18]
A survey on clinical natural language processing in the United Kingdom from 2007 to 2022
Wu, Honghan and Wang, Minhong and Wu, Jinge and Francis, Farah and Chang, Yun-Hsuan and Shavick, Alex and Dong, Hang and Poon, Michael T C and Fitzpatrick, Natalie and Levine, Adam P and Slater, Luke T and Handy, Alex and Karwath, Andreas and Gkoutos, Georgios V and Chelala, Claude and Shah, Anoop Dinesh and Stewart, Robert and Collier, Nigel and Alex, Be...
work page 2007
-
[19]
Wu, John and Wu, David and Sun, Jimeng. Beyond label attention: Transparency in language models for automated medical coding via dictionary learning. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
work page 2024
-
[20]
Combining classifiers in text categorization
Larkey, Leah S and Croft, W Bruce. Combining classifiers in text categorization. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
-
[21]
Code Synonyms Do Matter: Multiple Synonyms Matching Network for Automatic ICD Coding
Yuan, Zheng and Tan, Chuanqi and Huang, Songfang. Code Synonyms Do Matter: Multiple Synonyms Matching Network for Automatic ICD Coding. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
-
[22]
A systematic literature review of automated clinical coding and classification systems
Stanfill, Mary H and Williams, Margaret and Fenton, Susan H and Jenders, Robert A and Hersh, William R. A systematic literature review of automated clinical coding and classification systems. Journal of the American Medical Informatics Association: JAMIA
-
[23]
Towards Automated ICD Coding Using Deep Learning
Shi, Haoran and Xie, Pengtao and Hu, Zhiting and Zhang, Ming and Xing, Eric P. Towards Automated ICD Coding Using Deep Learning. arXiv [cs.CL]. arXiv:1711.04075
-
[24]
MIMIC- IV , a freely accessible electronic health record dataset
Johnson, Alistair E W and Bulgarelli, Lucas and Shen, Lu and Gayles, Alvin and Shammout, Ayad and Horng, Steven and Pollard, Tom J and Moody, Benjamin and Gow, Brian and Lehman, Li-Wei H and Celi, Leo A and Mark, Roger G. MIMIC- IV , a freely accessible electronic health record dataset. Scientific data
-
[25]
PLM - ICD : Automatic ICD Coding with Pretrained Language Models
Huang, Chao-Wei and Tsai, Shang-Chi and Chen, Yun-Nung. PLM - ICD : Automatic ICD Coding with Pretrained Language Models. Proceedings of the 4th Clinical Natural Language Processing Workshop
-
[26]
and Zimlichman Eyal and Barash Yiftach and Freeman Robert and Charney Alexander W
Soroush Ali and Glicksberg Benjamin S. and Zimlichman Eyal and Barash Yiftach and Freeman Robert and Charney Alexander W. and Nadkarni Girish N and Klang Eyal. Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying. NEJM AI
-
[27]
Vaswani, Ashish and Shazeer, Noam M and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia. Attention is All you Need. Neural Information Processing Systems. 1706.03762
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
A unified review of deep learning for automated medical coding
Ji, Shaoxiong and Li, Xiaobo and Sun, Wei and Dong, Hang and Taalas, Ara and Zhang, Yijia and Wu, Honghan and Pitkänen, Esa and Marttinen, Pekka. A unified review of deep learning for automated medical coding. ACM computing surveys
-
[29]
Gan, Yidong and Rybinski, Maciej and Hachey, Ben and Kummerfeld, Jonathan K. Aligning AI research with the needs of clinical coding workflows: Eight recommendations based on US data analysis and critical review. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
-
[30]
Less is more: Explainable and efficient ICD code prediction with clinical entities
Douglas, James C and Gan, Yidong and Hachey, Ben and Kummerfeld, Jonathan K. Less is more: Explainable and efficient ICD code prediction with clinical entities. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
-
[31]
Automated ICD coding using extreme multi-label long text transformer-based models
Liu, Leibo and Perez-Concha, Oscar and Nguyen, Anthony and Bennett, Vicki and Jorm, Louisa. Automated ICD coding using extreme multi-label long text transformer-based models. Artificial intelligence in medicine
-
[32]
MDACE : MIMIC documents annotated with code evidence
Cheng, Hua and Jafari, Rana and Russell, April and Klopfer, Russell and Lu, Edmond and Striner, Benjamin and Gormley, Matthew. MDACE : MIMIC documents annotated with code evidence. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.