pith. sign in

arxiv: 2606.28798 · v1 · pith:LMR7M6MZnew · submitted 2026-06-27 · 💻 cs.AI · stat.AP

Primary ICD Category Prediction using LLM-based Probing

Pith reviewed 2026-06-30 09:35 UTC · model grok-4.3

classification 💻 cs.AI stat.AP
keywords ICD codingLLM probingmultimodal EHRMIMIC-IVfrozen modelsclinical adaptersdiagnosis predictionlinear probes
0
0 comments X

The pith

Frozen medical LLM representations serve as a shared embedding space for multimodal primary ICD category prediction from EHR data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates whether hidden states from a frozen medical LLM can unify signals from structured EHR variables and clinical narratives to predict the main diagnosis category. It builds a MIMIC-IV cohort from the ten most common primary ICD-10 codes, serializes structured fields into text, merges them with discharge notes, and trains linear probes on layer-wise representations. The combined probe reaches 87.69 percent strict accuracy and 91.45 percent medical accuracy, beating single-modality probes and baselines, while a small adapter restores performance on MIMIC-III with only five percent of target labels. This matters for ICD coding because it drives reimbursement, research, and surveillance, so methods that reuse one frozen model across modalities and datasets could lower the cost of maintaining automated systems.

Core claim

Using a frozen MedFound-Llama3-8B-finetuned backbone, linear probes on combined serialized structured variables and discharge notes achieve 87.69 percent strict accuracy and 91.45 percent medical accuracy on MIMIC-IV primary ICD categories, surpassing single-modality probes and baselines such as XGBoost and PLM-ICD. Diagnostic information becomes increasingly linearly separable in deeper transformer layers. A 2M-parameter bottleneck adapter restores effective cross-dataset transfer to MIMIC-III using only 5 percent of target labels.

What carries the argument

Linear probes trained on hidden states from five layers of a frozen medical LLM, applied to multimodal inputs created by serializing structured EHR variables into narratives and merging with leakage-pruned discharge notes.

If this is right

  • The combined multimodal probe outperforms both structured-only and unstructured-only probes as well as XGBoost and PLM-ICD baselines on MIMIC-IV.
  • The structured-only probe improves medical accuracy by 6.19 points over its matched baseline.
  • Hidden states from deeper LLM layers exhibit greater linear separability for diagnostic categories.
  • A 2M-parameter adapter enables cross-dataset adaptation to MIMIC-III with only 5 percent of target labels.
  • LLM embeddings support efficient reuse of clinical representations across modalities and datasets through small representation-level modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same layer-wise probing setup could be applied to other EHR prediction targets such as length of stay or readmission risk without retraining the underlying model.
  • Serialization of tabular fields into text may allow similar unification in domains outside medicine that mix structured records with free text.
  • Small representation-space adapters may provide a general route for adapting clinical models to new institutions or updated coding systems with limited labeled data.
  • If the no-leakage premise holds, the method lowers the computational barrier for experimenting with multimodal clinical representations.

Load-bearing premise

Serializing structured EHR variables into clinical narratives and combining them with discharge notes does not introduce significant information leakage or bias, allowing the frozen LLM to serve as a shared embedding space.

What would settle it

A controlled test in which the combined probe shows no accuracy advantage over the unstructured-only probe on held-out MIMIC-IV admissions, or in which the 2M-parameter adapter produces no gain in MIMIC-III transfer accuracy beyond a randomly initialized linear layer.

read the original abstract

Objective: ICD codes are central to reimbursement, research, and population health surveillance, yet automated coding systems often struggle to integrate diagnostic signals from both clinical narratives and structured electronic health record (EHR) variables. We evaluated whether frozen medical large language model (LLM) representations can serve as a shared embedding space for multimodal primary diagnosis category prediction. Materials and Methods: We constructed a MIMIC-IV cohort of 13,645 admissions from the 10 most frequent primary ICD-10 codes, consolidated into seven categories. Structured variables were serialized into clinical narratives and combined with leakage-pruned discharge notes. Using a frozen MedFound-Llama3-8B-finetuned backbone, we extracted hidden states from five transformer layers and trained linear probes for structured-only, unstructured-only, and combined inputs, comparing against XGBoost and information-matched PLM-ICD baselines and evaluating MIMIC-III adaptation with a compact bottleneck adapter. Results: The combined probe performed best on MIMIC-IV (87.69% strict; 91.45% medical accuracy), exceeding both single-modality probes and baselines. The structured-only probe outperformed its standard baseline by 6.19 points in medical accuracy. Diagnostic information became increasingly linearly separable in deeper layers, and a 2M-parameter adapter restored cross-dataset transfer to MIMIC-III using only 5% of target labels. Discussion: LLM embeddings can unify structured and narrative EHR information for multimodal diagnosis prediction, supporting efficient reuse of clinical representations across modalities and datasets through a small representation-level module. Conclusion: Multimodal probing of frozen medical LLM representations provides a practical approach for studying EHR modalities and adapting clinical representations across datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that frozen representations from a medical LLM (MedFound-Llama3-8B) can act as a shared embedding space for multimodal primary ICD category prediction. Structured EHR variables are serialized into narratives and combined with leakage-pruned discharge notes from a MIMIC-IV cohort of 13,645 admissions covering seven primary diagnosis categories. Linear probes on combined inputs achieve 87.69% strict accuracy and 91.45% medical accuracy, outperforming single-modality probes and baselines like XGBoost and PLM-ICD. Diagnostic separability increases in deeper layers, and a 2M-parameter adapter enables transfer to MIMIC-III with only 5% target labels.

Significance. If the no-leakage assumption holds, the work offers a parameter-efficient method for integrating structured and unstructured EHR data using pre-trained LLMs, with strong empirical results on a real clinical cohort and successful cross-dataset adaptation. This could support more accurate automated coding systems and facilitate reuse of clinical representations across modalities and institutions.

major comments (1)
  1. [Materials and Methods] The data preparation subsection on serializing structured variables (demographics, labs, meds) into clinical narratives and concatenating with 'leakage-pruned' discharge notes must provide concrete evidence that the pruning removes all structured-derived ICD-relevant content. The abstract invokes this to justify the shared embedding premise, but without details on the pruning method (e.g., semantic matching vs. surface-level) and validation against residual correlations, the reported gains for the combined probe (87.69% strict accuracy) and the 6.19-point lift over XGBoost cannot be unambiguously attributed to multimodal integration rather than leakage artifacts.
minor comments (2)
  1. [Results] The abstract reports accuracy numbers but lacks details on cohort construction criteria, exact layer selection for probing, and statistical significance of improvements over baselines.
  2. [Abstract] Clarify the definition of 'medical accuracy' versus 'strict' accuracy, as this distinction is central to interpreting the 91.45% figure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for explicit validation of the leakage-pruning procedure. This is a substantive methodological point that strengthens the interpretability of our multimodal results. We address it directly below and will revise the manuscript to incorporate additional detail and evidence.

read point-by-point responses
  1. Referee: [Materials and Methods] The data preparation subsection on serializing structured variables (demographics, labs, meds) into clinical narratives and concatenating with 'leakage-pruned' discharge notes must provide concrete evidence that the pruning removes all structured-derived ICD-relevant content. The abstract invokes this to justify the shared embedding premise, but without details on the pruning method (e.g., semantic matching vs. surface-level) and validation against residual correlations, the reported gains for the combined probe (87.69% strict accuracy) and the 6.19-point lift over XGBoost cannot be unambiguously attributed to multimodal integration rather than leakage artifacts.

    Authors: We agree that the current description is insufficient to rule out residual leakage as a confounder. The pruning was performed via surface-level string matching and removal of any mention of the structured variables (demographics, lab results, medications) that appear in the discharge notes, followed by manual review of a 200-note sample. In the revision we will (1) expand the Materials and Methods section with the exact matching rules and code-level description, (2) report quantitative checks (Pearson correlation between pruned-note embeddings and structured-only embeddings for ICD-predictive tokens; zero-shot classifier accuracy on pruned notes for the seven categories), and (3) add an appendix table showing that these correlations remain near zero after pruning. These additions will allow readers to assess whether the 6.19-point medical-accuracy lift and the combined-probe superiority are attributable to true multimodal fusion. We will also clarify in the abstract that the pruning validation is provided in the supplement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained against external baselines.

full rationale

The paper reports an empirical machine-learning study: structured EHR variables are serialized into text, concatenated with leakage-pruned notes, passed through a frozen LLM, and linear probes are trained on the resulting hidden states. Performance is measured by strict accuracy and medical accuracy on held-out MIMIC-IV data and compared to XGBoost and PLM-ICD baselines; cross-dataset transfer uses a small adapter trained on 5% of MIMIC-III labels. No equations, uniqueness theorems, or derivations are presented that reduce any reported result to its own inputs by construction. No self-citations are invoked as load-bearing premises. The central claims rest on standard train/test splits and external baselines, rendering the evaluation falsifiable and independent of the reported numbers themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides limited technical detail on assumptions; the main reliance is on the effectiveness of linear probing on pre-trained representations, a common assumption in representation learning.

axioms (1)
  • domain assumption The hidden states from the frozen MedFound-Llama3-8B model capture linearly separable diagnostic information from EHR data
    Central to the probing method described in results.

pith-pipeline@v0.9.1-grok · 5833 in / 1450 out tokens · 58988 ms · 2026-06-30T09:35:06.118778+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    Ensemble neural models for icd code prediction using unstructured and structured healthcare data.Heliyon, 10(17), 2024

    Alimurtaza Mustafa Merchant, Naveen Shenoy, Sidharth Lanka, and Sowmya Kamath. Ensemble neural models for icd code prediction using unstructured and structured healthcare data.Heliyon, 10(17), 2024

  2. [2]

    An empirical evaluation of deep learning for icd-9 code assignment using mimic-iii clinical notes.Computer methods and programs in biomedicine, 177:141–153, 2019

    Jinmiao Huang, Cesar Osorio, and Luke Wicent Sy. An empirical evaluation of deep learning for icd-9 code assignment using mimic-iii clinical notes.Computer methods and programs in biomedicine, 177:141–153, 2019

  3. [3]

    International classification of diseases.WHO [Internet], page 53, 1992

    World Health Organization. International classification of diseases.WHO [Internet], page 53, 1992

  4. [4]

    Explainable prediction of medical codes from clinical text

    James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. Explainable prediction of medical codes from clinical text. InProceedings of the 2018 conference of the north American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers), pages 1101–1111, 2018

  5. [5]

    Plm-icd: Automatic icd coding with pretrained language models.arXiv preprint arXiv:2207.05289, 2022

    Chao-Wei Huang, Shang-Chi Tsai, and Yun-Nung Chen. Plm-icd: Automatic icd coding with pretrained language models.arXiv preprint arXiv:2207.05289, 2022

  6. [6]

    A generalist medical language model for disease diagnosis assistance.Nature medicine, 31(3):932–942, 2025

    Xiaohong Liu, Hao Liu, Guoxing Yang, Zeyu Jiang, Shuguang Cui, Zhaoze Zhang, Huan Wang, Liyuan Tao, Yongchang Sun, Zhu Song, et al. A generalist medical language model for disease diagnosis assistance.Nature medicine, 31(3):932–942, 2025

  7. [7]

    Clinical risk prediction using language models: benefits and considerations.Journal of the American Medical Informatics Association, 31(9): 1856–1864, 2024

    Angeela Acharya, Sulabh Shrestha, Anyi Chen, Joseph Conte, Sanja Avramovic, Siddhartha Sikdar, Antonios Anastasopoulos, and Sanmay Das. Clinical risk prediction using language models: benefits and considerations.Journal of the American Medical Informatics Association, 31(9): 1856–1864, 2024

  8. [8]

    Ekaterina Redekop, Zichen Wang, Rushikesh Kulkarni, Mara Pleasure, Aaron Chin, Hamid Reza Hassanzadeh, Brian L Hill, Melika Emami, William F Speier, and Corey W Arnold. Zero-shot medical event prediction using a generative pretrained transformer on electronic health records.Journal of the American Medical Informatics Association, 32(12):1833–1842, 2025

  9. [9]

    Medrep: medical concept representations for general electronic health record foundation models.Journal of the American Medical Informatics Association, page ocag032, 2026

    Junmo Kim, Namkyeong Lee, Jiwon Kim, and Kwangsoo Kim. Medrep: medical concept representations for general electronic health record foundation models.Journal of the American Medical Informatics Association, page ocag032, 2026

  10. [10]

    MIMIC-III clinical database.PhysioNet, September 2016

    Alistair Johnson, Tom Pollard, and Roger Mark. MIMIC- III Clinical Database.PhysioNet, September 2016. doi: 10.13026/C2XW26. URLhttps://doi.org/10.13026/C2XW26. Version 1.4

  11. [11]

    MIMIC-IV.PhysioNet, October 2024

    Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Brian Gow, Benjamin Moody, Steven Horng, Leo Anthony Celi, and Roger Mark. MIMIC-IV.PhysioNet, October 2024. doi: 10.13026/kpb9-mt58. URLhttps://doi.org/10.13026/ kpb9-mt58. Version 3.1

  12. [12]

    Sunil V Rao, Michelle L O’Donoghue, Marc Ruel, Tanveer Rab, Jaqueline E Tamis-Holland, John H Alexander, Usman Baber, Heather Baker, Mauricio G Cohen, Mercedes Cruz- Ruiz, et al. 2025 acc/aha/acep/naemsp/scai guideline for the management of patients with acute coronary syndromes: a report of the american college of cardiology/american heart association jo...

  13. [13]

    Janani Rangaswami, Vivek Bhalla, John EA Blair, Tara I Chang, Salvatore Costa, Krista L Lentine, Edgar V Lerma, Kenechukwu Mezue, Mark Molitch, Wilfried Mullens, et al. Cardiorenal syndrome: classification, pathophysiology, diagnosis, and treatment strategies: a scientific statement from the american heart association.Circulation, 139(16): e840–e878, 2019

  14. [14]

    Construction and evaluation of a sepsis risk prediction model for urinary tract infection.Frontiers in Medicine, 8:671184, 2021

    Luming Zhang, Feng Zhang, Fengshuo Xu, Zichen Wang, Yinlong Ren, Didi Han, Jun Lyu, and Haiyan Yin. Construction and evaluation of a sepsis risk prediction model for urinary tract infection.Frontiers in Medicine, 8:671184, 2021

  15. [15]

    Understanding intermediate layers using linear classifier probes

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

  16. [16]

    Designing and interpreting probes with control tasks

    John Hewitt and Percy Liang. Designing and interpreting probes with control tasks. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (emnlp-ijcnlp), pages 2733– 2743, 2019

  17. [17]

    Xgboost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016

  18. [18]

    Multimodal machine learning for automated icd coding

    Keyang Xu, Mike Lam, Jingzhi Pang, Xin Gao, Charlotte Band, Piyush Mathur, Frank Papay, Ashish K Khanna, Jacek B Cywinski, Kamal Maheshwari, et al. Multimodal machine learning for automated icd coding. InMachine learning for healthcare conference, pages 197–215. PMLR, 2019

  19. [19]

    Scalable and accurate deep learning with electronic health records.NPJ digital medicine, 1(1):18, 2018

    Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. Scalable and accurate deep learning with electronic health records.NPJ digital medicine, 1(1):18, 2018

  20. [20]

    Multitask learning and benchmarking with clinical time series data.Scientific data, 6(1):96, 2019

    Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data.Scientific data, 6(1):96, 2019

  21. [21]

    Medfuse: Multimodal ehr data fusion with masked lab-test modeling and large language models

    Phan Nguyen Minh Thao, Cong-Tinh Dao, Chenwei Wu, Jian-Zhe Wang, Shun Liu, Jun-En Ding, David Restrepo, Feng Liu, Fang-Ming Hung, and Wen-Chih Peng. Medfuse: Multimodal ehr data fusion with masked lab-test modeling and large language models. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 3974–3978, 2024

  22. [22]

    A multimodal transformer: Fusing clinical notes with structured ehr data for interpretable in-hospital mortality prediction

    Weimin Lyu, Xinyu Dong, Rachel Wong, Songzhu Zheng, Kayley Abell-Hart, Fusheng Wang, and Chao Chen. A multimodal transformer: Fusing clinical notes with structured ehr data for interpretable in-hospital mortality prediction. InAMIA Annual Symposium Proceedings, volume 2022, page 719, 2023

  23. [23]

    Doctor ai: Predicting clinical events via recurrent neural networks

    Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Doctor ai: Predicting clinical events via recurrent neural networks. InMachine learning for healthcare conference, pages 301–318. PMLR, 2016

  24. [24]

    Retain: 8 An interpretable predictive model for healthcare using reverse time attention mechanism.Advances in neural information processing systems, 29, 2016

    Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. Retain: 8 An interpretable predictive model for healthcare using reverse time attention mechanism.Advances in neural information processing systems, 29, 2016

  25. [25]

    Data integration of structured and unstructured sources for assigning clinical codes to patient stays.Journal of the American Medical Informatics Association, 23(e1):e11–e19, 2016

    Elyne Scheurwegs, Kim Luyckx, L´ eon Luyten, Walter Daelemans, and Tim Van den Bulcke. Data integration of structured and unstructured sources for assigning clinical codes to patient stays.Journal of the American Medical Informatics Association, 23(e1):e11–e19, 2016

  26. [26]

    Sicen Liu, Xiaolong Wang, Yongshuai Hou, Ge Li, Hui Wang, Hui Xu, Yang Xiang, and Buzhou Tang. Multimodal data matters: Language model pre-training over structured and unstructured electronic health records.IEEE Journal of Biomedical and Health Informatics, 27(1):504–514, 2022. 9