pith. sign in

arxiv: 2604.22348 · v1 · submitted 2026-04-24 · 💻 cs.LG

A Nationwide Japanese Medical Claims Foundation Model: Balancing Model Scaling and Task-Specific Computational Efficiency

Pith reviewed 2026-05-08 12:25 UTC · model grok-4.3

classification 💻 cs.LG
keywords medical claims datafoundation modelsmodel scalingtransformerdisease predictionmedication predictioncomputational efficiency
0
0 comments X

The pith

Medical claims foundation models reach peak downstream performance at task-dependent sizes rather than always scaling larger.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether the scaling laws observed in language models apply to structured medical claims data, which has a small vocabulary and sparse records. Researchers pretrained encoder-only Transformers at five sizes from 2.2 million to 101 million parameters on a sample of 2.3 million patients from a large Japanese claims database, then tested them on disease incidence and medication prediction. Performance saturated at different thresholds: disease prediction improved with larger models up to 101 million parameters, while medication prediction stopped improving beyond 11 million parameters. The best models at each scale outperformed a Light Gradient Boosting Machine baseline on area under the precision-recall curve, showing that compute can be saved on some tasks without losing accuracy.

Core claim

Encoder-only Transformer models pretrained on nationwide Japanese medical claims data from 2.3 million patients show that downstream performance on disease prediction continues to benefit from scaling up to 101 million parameters, whereas medication prediction saturates at 11 million parameters, allowing a reduction in pretraining time of 178 hours while the best-performing model at each scale still exceeds a Light Gradient Boosting Machine baseline in area under the precision-recall curve.

What carries the argument

Five-scale pretraining of encoder-only Transformers (2.2M to 101M parameters) on longitudinal structured medical claims, followed by task-specific fine-tuning and evaluation on disease incidence and medication prediction.

Load-bearing premise

The observed differences in saturation thresholds are caused by the inherent characteristics of the two prediction tasks on structured claims data rather than by sampling choices, sparsity patterns, or hyperparameter decisions specific to this Japanese database.

What would settle it

Re-running the five-scale pretraining and evaluation experiment on an independent non-Japanese claims dataset and finding either no saturation or identical saturation thresholds across both tasks would falsify the claim of task-dependent optimal sizes.

Figures

Figures reproduced from arXiv: 2604.22348 by Akiko Hatakama, Eiichiro Uchino, Masaki Nakamura, Nanae Aratake, Nobutomo Matsui, Taisei Tosaki, Yasushi Okuno, Yuji Okamoto.

Figure 1
Figure 1. Figure 1: Overview of pretraining and fine-tuning. Masked language modeling (MLM) pretraining was performed on token view at source ↗
Figure 2
Figure 2. Figure 2: Input-sequence construction and training objectives. Each clinical event token pairs a code (ICD-10 or YJ) with age in view at source ↗
Figure 3
Figure 3. Figure 3: Pretraining loss versus computational cost. The test view at source ↗
Figure 4
Figure 4. Figure 4: Pretrained vs. from-scratch AUPRC at the 11M scale view at source ↗
Figure 5
Figure 5. Figure 5: AUPRC of five pretrained Transformers and LGBM across four tasks. Disease prediction benefits from the 32M/101M view at source ↗
read the original abstract

Clinical risk prediction using longitudinal medical data supports individualized care. Self-supervised foundation models have emerged as a promising approach for leveraging large-scale unlabeled healthcare records. In natural language processing, scaling laws suggest that larger models achieve predictably lower pretraining losses, supporting the foundation model paradigm. However, for structured medical data, characterized by a limited vocabulary and sparse observations, whether increasing model size consistently improves downstream predictions is unclear, as most studies evaluate only a single model scale. In this study, we evaluated the relationship between model scale and downstream task performance for structured medical foundation models. Using a random sample (2.3 million patients, 32 hospitals) from a nationwide 519-hospital Japanese claims database, we pretrained encoder-only Transformers at five scales (2.2M-101M parameters) for disease incidence and medication prediction. Downstream performance saturated at task-dependent thresholds: disease prediction benefited from larger models (32M-101M), whereas medication prediction saturated at 11M, reducing pretraining time by 178 h. Across all tasks, the best-performing model consistently outperformed a Light Gradient Boosting Machine baseline in the area under the precision-recall curve. These findings indicate that, unlike the monotonically decreasing pretraining loss, the optimal model size varied depending on task characteristics. This task-dependent saturation provides practical guidance for balancing predictive performance and computational cost in structured medical foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates scaling of encoder-only Transformer foundation models on structured Japanese medical claims data (2.3M patients from a 519-hospital database). Five model scales (2.2M–101M parameters) are pretrained self-supervised and evaluated on disease incidence and medication prediction tasks. The central claim is that downstream AUPRC saturates at task-dependent thresholds (medication prediction at 11M parameters; disease prediction benefits up to 32M–101M), enabling a 178-hour pretraining reduction, while the best models outperform a LightGBM baseline across tasks. This indicates that optimal scale for structured medical data is not monotonically increasing as in NLP.

Significance. If the saturation points are robust, the result supplies concrete, actionable guidance for balancing predictive performance against compute in healthcare foundation models, where data sparsity and limited vocabulary differ from text. The use of real nationwide claims data, multi-scale empirical comparison, and direct LightGBM baseline evaluation are strengths that make the work practically relevant even if the precise thresholds are dataset-specific.

major comments (2)
  1. [Methods (pretraining details)] The pretraining setup (Methods section) provides no information on whether hyperparameters (learning rate, batch size, warmup, dropout) were held fixed or scaled with model size. Because the central saturation claim for medication prediction at 11M rests on the assumption that larger models were not under-trained, this omission is load-bearing; fixed schedules would artifactually produce the observed plateau even if the data distribution supports further gains.
  2. [Results (downstream evaluation)] Results reporting of downstream AUPRC values lacks data-split details (e.g., patient-level or temporal partitioning of the 2.3M cohort), statistical testing, or error bars. Without these, the task-dependent saturation thresholds and the claim that larger models improve disease prediction cannot be distinguished from sampling variability or selection bias in the random 32-hospital subsample.
minor comments (2)
  1. [Abstract] The abstract states a 178-hour pretraining reduction but does not specify hardware, wall-clock vs. total FLOPs, or whether the time includes only the 11M model or the full scaling curve.
  2. [Introduction / Methods] Notation for model scales (2.2M–101M) and task definitions (disease incidence vs. medication prediction) should be defined explicitly in a table or early section to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which have helped us identify areas where the manuscript can be strengthened for clarity and rigor. We provide point-by-point responses to the major comments below, along with our plans for revision.

read point-by-point responses
  1. Referee: [Methods (pretraining details)] The pretraining setup (Methods section) provides no information on whether hyperparameters (learning rate, batch size, warmup, dropout) were held fixed or scaled with model size. Because the central saturation claim for medication prediction at 11M rests on the assumption that larger models were not under-trained, this omission is load-bearing; fixed schedules would artifactually produce the observed plateau even if the data distribution supports further gains.

    Authors: We appreciate the referee's emphasis on this key aspect of the experimental design. The hyperparameters were held fixed across model scales, with only the architectural dimensions (layers and hidden size) adjusted to achieve the reported parameter counts. This fixed-hyperparameter approach is intentional in scaling studies to isolate the effect of model capacity. Pretraining was monitored via validation loss to confirm convergence for all scales. We will revise the Methods section to include an explicit table of all hyperparameter values and a statement confirming the fixed schedule, thereby directly addressing the concern that the observed saturation for medication prediction could be an artifact of under-training. revision: yes

  2. Referee: [Results (downstream evaluation)] Results reporting of downstream AUPRC values lacks data-split details (e.g., patient-level or temporal partitioning of the 2.3M cohort), statistical testing, or error bars. Without these, the task-dependent saturation thresholds and the claim that larger models improve disease prediction cannot be distinguished from sampling variability or selection bias in the random 32-hospital subsample.

    Authors: We agree that these details are necessary to support the robustness of the saturation claims. We will revise the Methods and Results sections to specify the patient-level partitioning approach used for the 2.3M cohort and to clarify the random selection process for the 32-hospital subsample from the full database. We will also add uncertainty quantification (such as bootstrap confidence intervals) for the AUPRC values and include statistical comparisons between scales where feasible. These additions will allow readers to better assess whether the task-dependent patterns exceed sampling variability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical scaling evaluation on held-out tasks

full rationale

The paper reports direct experimental results from pretraining five encoder-only Transformer scales (2.2M–101M parameters) on a sampled Japanese claims database and measuring downstream AUPRC on disease incidence and medication prediction tasks. No mathematical derivations, equations, or predictions are presented that reduce to fitted inputs by construction. Central claims rest on observed task-dependent saturation thresholds and consistent outperformance versus a LightGBM baseline; these are independent empirical observations on held-out data rather than self-definitional, self-cited, or ansatz-smuggled quantities. The study contains no load-bearing self-citations, uniqueness theorems, or renamings of known results.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Empirical ML study; free parameters are the author-chosen model scales and training setup. Axioms are standard domain assumptions about data representativeness and model suitability. No new entities introduced.

free parameters (1)
  • model parameter scales
    Five discrete scales (2.2M to 101M parameters) selected by authors to probe scaling behavior rather than derived from theory or data.
axioms (2)
  • domain assumption Encoder-only Transformer is suitable for modeling longitudinal structured medical claims
    Core architecture choice for all pretraining experiments.
  • domain assumption Random sample of 2.3M patients from 32 hospitals is representative for scaling evaluation
    Basis for all pretraining and downstream results.

pith-pipeline@v0.9.0 · 5579 in / 1513 out tokens · 43553 ms · 2026-05-08T12:25:59.613881+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 4 internal anchors

  1. [1]

    Screening for peripheral artery dis- ease and cardiovascular disease risk assessment with the ankle-brachial index: US preventive services task force recommendation statement,

    US Preventive Services Task Force, “Screening for peripheral artery dis- ease and cardiovascular disease risk assessment with the ankle-brachial index: US preventive services task force recommendation statement,” JAMA, vol. 320, no. 2, pp. 177–183, Jul. 2018

  2. [2]

    Lifestyle factors and risk of multimorbidity of cancer and cardiometabolic diseases: A multinational cohort study,

    H. Freislinget al., “Lifestyle factors and risk of multimorbidity of cancer and cardiometabolic diseases: A multinational cohort study,”BMC Med., vol. 18, p. 5, Jan. 2020

  3. [3]

    Scalable and accurate deep learning with electronic health records,

    A. Rajkomaret al., “Scalable and accurate deep learning with electronic health records,”npj Digit. Med., vol. 1, p. 18, May 2018

  4. [4]

    Deep patient: An unsupervised representation to predict the future of patients from the electronic health records,

    R. Miotto, L. Li, B. A. Kidd, and J. T. Dudley, “Deep patient: An unsupervised representation to predict the future of patients from the electronic health records,”Sci. Rep., vol. 6, p. 26094, May 2016

  5. [5]

    Semi-supervised learning of the electronic health record for phenotype stratification,

    B. K. Beaulieu-Jones and C. S. Greene, “Semi-supervised learning of the electronic health record for phenotype stratification,”J. Biomed. Inform., vol. 64, pp. 168–178, Dec. 2016

  6. [6]

    Scaling Laws for Neural Language Models

    J. Kaplanet al., “Scaling laws for neural language models,” arXiv:2001.08361, Jan. 2020

  7. [7]

    On the Opportunities and Risks of Foundation Models

    R. Bommasaniet al., “On the opportunities and risks of foundation models,”arXiv:2108.07258, Aug. 2021

  8. [8]

    Training Compute-Optimal Large Language Models

    J. Hoffmannet al., “Training compute-optimal large language models,” arXiv:2203.15556, Mar. 2022

  9. [9]

    BEHRT: Transformer for electronic health records,

    Y . Liet al., “BEHRT: Transformer for electronic health records,”Sci. Rep., vol. 10, p. 7155, Apr. 2020

  10. [10]

    Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction,

    L. Rasmy, Y . Xiang, Z. Xie, C. Tao, and D. Zhi, “Med-BERT: Pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction,”npj Digit. Med., vol. 4, p. 86, May 2021

  11. [11]

    CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks,

    C. Panget al., “CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks,” inProc. Mach. Learn. Health (ML4H), 2021, pp. 239–260

  12. [12]

    Hi-BEHRT: Hierarchical transformer-based model for accurate prediction of clinical events using multimodal electronic health records,

    Y . Liet al., “Hi-BEHRT: Hierarchical transformer-based model for accurate prediction of clinical events using multimodal electronic health records,”IEEE J. Biomed. Health Inform., vol. 26, no. 2, pp. 709–719, 2022

  13. [13]

    Attention is all you need,

    A. Vaswaniet al., “Attention is all you need,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5998–6008

  14. [14]

    doi:10.48550/arXiv.2207.08815 , urldate =

    L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?” arXiv:2207.08815, Jul. 2022

  15. [15]

    Context and considerations for use of two Japanese real-world databases in Japan: Medical Data Vision and Japanese Medical Data Center,

    T. Laurentet al., “Context and considerations for use of two Japanese real-world databases in Japan: Medical Data Vision and Japanese Medical Data Center,”Drugs—Real World Outcomes, vol. 9, no. 2, pp. 175–187, Jun. 2022

  16. [16]

    Use of healthcare claims data to generate real-world evidence on patients with drug-resistant epilepsy: Practical considera- tions for research,

    N. Stamaset al., “Use of healthcare claims data to generate real-world evidence on patients with drug-resistant epilepsy: Practical considera- tions for research,”J. Health Econ. Outcomes Res., vol. 11, no. 1, pp. 57–66, Feb. 2024

  17. [17]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, Oct. 2018

  18. [18]

    LightGBM: A highly efficient gradient boosting decision tree,

    G. Keet al., “LightGBM: A highly efficient gradient boosting decision tree,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 3146– 3154

  19. [19]

    A large language model for electronic health records,

    X. Yanget al., “A large language model for electronic health records,” npj Digit. Med., vol. 5, no. 1, p. 194, 2022

  20. [20]

    Health system-scale language models are all-purpose prediction engines,

    L. R. Jianget al., “Health system-scale language models are all-purpose prediction engines,”Nature, vol. 619, pp. 357–362, 2023

  21. [21]

    Large language models encode clinical knowledge,

    K. Singhalet al., “Large language models encode clinical knowledge,” Nature, vol. 620, pp. 172–180, 2023

  22. [22]

    TransformEHR: Transformer-based encoder-decoder generative model to enhance pre- diction of disease outcomes using electronic health records,

    Z. Yang, A. Mitra, W. Liu, D. Berlowitz, and H. Yu, “TransformEHR: Transformer-based encoder-decoder generative model to enhance pre- diction of disease outcomes using electronic health records,”Nature Commun., vol. 14, p. 7857, 2023

  23. [23]

    GenHPF: General healthcare predictive framework for multi-task multi-source learning,

    K. Huret al., “GenHPF: General healthcare predictive framework for multi-task multi-source learning,”IEEE J. Biomed. Health Inform., vol. 28, no. 10, pp. 6098–6108, 2024

  24. [24]

    EHRSHOT: An EHR benchmark for few-shot evaluation of foundation models,

    M. Wornow, Y . Rahul, E. Steinberg, S. Fleming, N. H. Shah, and J. A. Fries, “EHRSHOT: An EHR benchmark for few-shot evaluation of foundation models,” inProc. 37th Conf. Neural Inf. Process. Syst. (NeurIPS) Datasets Benchmarks Track, 2023

  25. [25]

    Generative medical event models improve with scale.arXiv preprint arXiv:2508.12104, 2025

    S. Waxleret al., “Generative medical event models improve with scale,” arXiv:2508.12104, 2025

  26. [26]

    Making pre-trained language models great on tabular prediction,

    J. Yanet al., “Making pre-trained language models great on tabular prediction,” inProc. Int. Conf. Learn. Representations (ICLR), 2024

  27. [27]

    Revisiting deep learning models for tabular data,

    Y . Gorishniy, I. Rubachev, V . Khrulkov, and A. Babenko, “Revisiting deep learning models for tabular data,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2021

  28. [28]

    On embeddings for numerical features in tabular deep learning,

    Y . Gorishniy, I. Rubachev, and A. Babenko, “On embeddings for numerical features in tabular deep learning,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2022

  29. [29]

    MIMIC-IV, a freely accessible electronic health record dataset,

    A. E. W. Johnsonet al., “MIMIC-IV, a freely accessible electronic health record dataset,”Sci. Data, vol. 10, p. 1, 2023

  30. [30]

    Foundation models for generalist medical artificial intelligence,

    M. Mooret al., “Foundation models for generalist medical artificial intelligence,”Nature, vol. 616, pp. 259–265, 2023

  31. [31]

    The impact of longitudinal data-completeness of electronic health record data on the prediction performance of clinical risk scores,

    Y . Jin, J. G. Weberpals, S. V . Wang, R. J. Desai, D. Merola, and K. J. Lin, “The impact of longitudinal data-completeness of electronic health record data on the prediction performance of clinical risk scores,”Clin. Pharmacol. Therapeutics, vol. 113, no. 6, pp. 1359–1367, Jun. 2023

  32. [32]

    A multi-center study on the adaptability of a shared foundation model for electronic health records,

    A. Dattaet al., “A multi-center study on the adaptability of a shared foundation model for electronic health records,”npj Digit. Med., vol. 7, p. 191, 2024. 9 Supplementary Materials A. Summary Statistics of the Full Analysis Cohort (N= 2,294,687) Table S1 summarizes the full analysis cohort (N= 2,294,687). Fig. S1a–S1d show distributions of sequence leng...

  33. [36]

    Chronic Kidney Disease 2.2M Pretrained ModelSeed 42Seed 123Seed 456 From Scratch Model Seed 42Seed 123Seed 456 11

    Chronic Kidney Disease 2. Chronic Kidney Disease 2.2M Pretrained ModelSeed 42Seed 123Seed 456 From Scratch Model Seed 42Seed 123Seed 456 11

  34. [40]

    Chronic Kidney Disease 4.7M Pretrained ModelSeed 42Seed 123Seed 456 From Scratch Model Seed 42Seed 123Seed 456 12

    Chronic Kidney Disease 2. Chronic Kidney Disease 4.7M Pretrained ModelSeed 42Seed 123Seed 456 From Scratch Model Seed 42Seed 123Seed 456 12

  35. [44]

    Chronic Kidney Disease 11M Pretrained ModelSeed 42Seed 123Seed 456 From Scratch Model Seed 42Seed 123Seed 456 13

    Chronic Kidney Disease 2. Chronic Kidney Disease 11M Pretrained ModelSeed 42Seed 123Seed 456 From Scratch Model Seed 42Seed 123Seed 456 13

  36. [48]

    Chronic Kidney Disease 32M Pretrained ModelFrom Scratch Model Seed 42Seed 123Seed 456 Seed 42Seed 123Seed 456 14

    Chronic Kidney Disease 2. Chronic Kidney Disease 32M Pretrained ModelFrom Scratch Model Seed 42Seed 123Seed 456 Seed 42Seed 123Seed 456 14

  37. [49]

    Pregabalin Number of PatientsNumber of Patients AUROC AUPRC

  38. [50]

    Amlodipine

    Amlodipine3. Amlodipine

  39. [51]

    Primary Hypertension

    Primary Hypertension 1. Primary Hypertension

  40. [52]

    Chronic Kidney Disease 101M Pretrained ModelFrom Scratch Model Seed 42Seed 123Seed 456 Seed 42Seed 123Seed 456 Fig

    Chronic Kidney Disease 2. Chronic Kidney Disease 101M Pretrained ModelFrom Scratch Model Seed 42Seed 123Seed 456 Seed 42Seed 123Seed 456 Fig. S2. Comparison of test AUROC (left column) and test AUPRC (right column) between pretrained and from-scratch models across all five model sizes (2.2M = blue, 4.7M = orange, 11M = green, 32M = red, and 101M = purple)...